JBOF - A Dataset from Just a Bunch of Files

Many datasets consist of large collections of disjointed files, often with various levels of metadata. Managing such a dataset is finicky and error-prone. JBOF standardizes the creation and access to such a dataset.

JBOF is available for Python and Matlab (read-only). It includes a partial implementation of Numpy’s NPY format for Matlab.

JBOF is free software under the terms of the GPL v3 license.

Design Notes

DataSets are commonly stored as loose collections of data files and metadata, or as monolithic archives as HDF or MAT files. The former is hard to work with, since every dataset has its own formatting conventions and file encodings. The latter is hard to explore without thorough documentation, and often significantly slower than the former.

Specifically, I was struggling with HDF files full of audio data and audio-related features. Being a single file, it was hard for multiple processes to access such an archive. Due to its internal structure, metadata access was unnecessarily slow, which made searching the archive a chore. Furthermore, audio data is easily compressible, but no such compression methods could be used in HDF.

JBOF solves this problem for me, by storing audio data and metadata as audio and JSON files, in a clean, HDF-like file structure. The file structure is easily explorable with a file browser, an audio player, and a text editor. It is also fast to access from any program that can read JSON and audio files. For non-audio data, other file formats such as npy and mat are supported as well.

A JBOF DataSet is organized as a flat list of Items, which contain Arrays. Both the DataSet, the Items, and the Arrays can have metadata attached, which are stored as simple JSON files. Every action is thread-safe, in that every add/delete Item/Array only ever touches files within the Item/Array in question, and does not interfere with other concurrent actions.

Sometimes, it is beneficial to save a DataSet as a single file. To enable this, DataSets can be exported as HDF and ZIP files. These HDF/ZIP files can still be opened as DataSets, albeit only in read-only mode. JBOF provides convenience methods for converting DataSets to/from HDF. To convert to/from ZIP, simply zip/unzip the directory. HDF datasets are significantly slower than plain files, and often bigger due to lack of audio-specific file encodings. Both ZIP and HDF files might be slower to read with multiple processes.

Structure

In JBOF, a dataset consists of many Items, each of which may contain many Arrays. The dataset, each item, and each array can have arbitrary metadata, serialized as .json files. Arrays are Numpy arrays, stored as .npy-files, or as various other file formats (.mat, .wav, .ogg, .flac).

On disk, a dataset in JBOF might look like this:

dataset
├── __init__.py
├── _metadata.json
├── item1
│   ├── _metadata.json
│   ├── array1.json
│   ├── array1.npy
│   ├── array2.json
│   └── array2.npy
└── item2
    ├── _metadata.json
    ├── array1.json
    └── array1.npy

This structure is meant to be both simple to parse, and human-readable.

  • The dataset directory contains its own metadata as _metadata.json, and sub-directories for each item.

  • Each item directory contains its own metadata as _metadata.json, and pairs of files for each array.

  • Each array is a pair of files, one for metadata array.json, and one containing the actual data array.npy.

By making sure that each operation on the dataset and items is atomic, multiple processes can read/write the dataset concurrently without fear of data corruption or race conditions.

Reading Data

To load a dataset, you can either import it, or load it from a URI:

# either:
>>> from dataset import dataset
# or
>>> import jbof
>>> dataset = jbof.DataSet('dataset')

When opening a dataset like this, it is read-only (set readonly=False to make it writeable).

Access the dataset’s metadata:

# get all metadata:
>>> dataset.metadata
{'license': 'GPL v3'}

Metadata is stored as a JSON file, and can be any JSON-serializable data.

Access the dataset’s items and items’ metadata:

>>> for item in dataset.all_items():
>>>     print(item.metadata)
{'timestamp': '2018-01-25 15:20'}
{'timestamp': '2018-01-25 15:25'}

You can search for items that match criteria:

>>> for item in dataset.find_items(timestamp='2018-01-25 15:20'):
>>>    print(item.metadata)
{'timestamp': '2018-01-25 15:20'}

There are a few more search criteria, such as providing multiple valid matches, or only returning a single search result.

Access each item’s arrays:

# either use `all_arrays`:
>>> for name, array in item.all_arrays():
>>>    print(name, array.metadata, array)
array1 {'timestamp': '2018-01-25 15:20'} [numpy.ndarray data]
# or access a single array:
>>> item.array1.metadata
{'timestamp': '2018-01-25 15:20'}
>>> item.array1
[numpy.ndarray data]

Writing Data

Create a new, writeable dataset:

>>> import jbof
>>> dataset = jbof.create_dataset('new_dataset', metadata={...})

Then, add items and data:

>>> item = dataset.add_item(name="...", metadata={...})
>>> item.add_array('array1', [your data], metadata={...})
>>> item.add_array('array2', [your data], metadata={...})

If you don’t name items, they are assigned random UUIDs. Alternatively, you can supply an itemformat to the DataSet, which will generate item names from the item metadata.

You can delete arrays and Items with Item.delete_array and DataSet.delete_item, and the whole dataset with delete_dataset.

API Documentation

jbof.delete_dataset(dataset)

Deletes the whole dataset permanently from the hard drive.

jbof.create_dataset(directory, metadata=None, itemformat=None)

Create a new dataset.

  • metadata must be JSON-serializable

  • itemformat is a str.format string, to be called with an item’s metadata to create the item’s directory name. This is useful for creating human-readable directories.

class jbof.DataSet(directory, readonly=True)

A structured collection of items that contain data.

add_item(name=None, metadata=None)

Add a new, empty item.

all_items()

A generator that returns all items.

calculate_hash()

Calculates an md5 hash of all data.

delete_item(item)

Deletes item permanently from the hard drive.

This invalidates the find_items cache.

find_items(**query)

Search for items that match query.

Query can be arbitrary keyword arguments that are matched in the metadata.

  • if the query is a string, e.g. foo=’bar’, the metadata must contain foo=’bar’.

  • if the query is a list of strings, e.g. foo=[‘bar’, ‘baz’], the metadata must contain either foo=’bar’ or foo=’baz’.

Returns a generator that walks all matching items.

If there are many items, the first run might be slow, but subsequent find_items will search through cached metadata, and will run faster.

find_one_item(**query)

Search for items that match query.

See find_items for details.

get_item(name)

Get an item by name.

has_item(name)

Check if item of name exists.

Or use name in dataset instead.

property itemformat

The str.format format string for generating item names from metadata.

property metadata

The Dataset’s metadata dict.

property name

The dataset’s directory name.

class jbof.Item(directory, readonly)

A collection of arrays in a dataset.

add_array(name, data, metadata=None, fileformat='npy', samplerate=None)

Add a new array.

  • name is the name of the array.

  • data must be numpy-serializable.

  • metadata must be JSON-serializable.

  • fileformat must be one of ['npy', 'wav', 'flac', 'ogg', 'mat'] if fileformat is 'wav', 'flac', or 'ogg', samplerate must be given.

add_array_from_file(name, filename, metadata=None)

Add a new array from an existing file.

  • name is the name of the array.

  • filename is the file to be added (must be one of .npy, .wav, .flac, .ogg, .mat)

  • metadata must be JSON-serializable.

all_arrays()

A generator that returns all arrays as name-value pairs.

delete_array(array)

Deletes array permanently from the hard drive.

has_array(name)

Check if array of name exists.

Or use name in item instead.

property metadata

The item’s metadata dict.

property name

The item’s directory name.

class jbof.Array(metafile)

A subclass of numpy.ndarray with a _filename and metadata.

metadata

The Array’s metadata dict.

_filename

The Array’s file name.

jbof.dataset_to_hdf(dataset, hdffilename)

Convert the dataset to HDF5.

jbof.hdf_to_dataset(hdfdataset, directory)

Convert a HDF5 dataset to a JBOF directory.

Indices and tables