Remotely-Sourced Zoo Datasets ¶¶

This page describes how to work with and create zoo datasets whose download/preparation methods are hosted via GitHub repositories or public URLs.

Note

To download from a private GitHub repository that you have access to, provide your GitHub personal access token by setting the GITHUB_TOKEN environment variable.

Working with remotely-sourced datasets ¶¶

Working with remotely-sourced zoo datasets is just like built-in zoo datasets, as both varieties support the full zoo API.

When specifying remote sources, you can provide any of the following:

A GitHub repo URL like https://github.com/<user>/<repo>
A GitHub ref like https://github.com/<user>/<repo>/tree/<branch> or https://github.com/<user>/<repo>/commit/<commit>
A GitHub ref string like <user>/<repo>[/<ref>]
A publicly accessible URL of an archive (eg zip or tar) file

Here’s the basic recipe for working with remotely-sourced zoo datasets:

Creating remotely-sourced datasets ¶¶

A remotely-sourced dataset is defined by a directory with the following contents:

fiftyone.yml
__init__.py
    def download_and_prepare(dataset_dir, split=None, **kwargs):
        pass

    def load_dataset(dataset, dataset_dir, split=None, **kwargs):
        pass

Each component is described in detail below.

Note

By convention, datasets also contain an optional README.md file that provides additional information about the dataset and example syntaxes for downloading and working with it.

fiftyone.yml ¶¶

The dataset’s fiftyone.yml or fiftyone.yaml file defines relevant metadata about the dataset:

Field	Required?	Description
`name`	yes	The name of the dataset. Once you’ve downloaded all or part of a remotely-sourced zoo dataset, it will subsequently appear as an available zoo dataset under this name when using the zoo API
`type`		Declare that the directory defines a `dataset`. This can be omitted for backwards compatibility, but it is recommended to specify this
`author`		The author of the dataset
`version`		The version of the dataset
`url`		The source (eg GitHub repository) where the directory containing this file is hosted
`source`		The original source of the dataset
`license`		The license under which the dataset is distributed
`description`		A brief description of the dataset
`fiftyone.version`		A semver version specifier (or `*`) describing the required FiftyOne version for the dataset to load properly
`supports_partial_downloads`		Specify `true` or `false` whether parts of the dataset can be downloaded/loaded by providing `kwargs` to `download_zoo_dataset()` or `load_zoo_dataset()` as described here. If omitted, this is assumed to be `false`
`tags`		A list of tags for the dataset. Useful in conjunction with `list_zoo_datasets()`
`splits`		A list of the dataset’s supported splits. This should be omitted if the dataset does not contain splits
`size_samples`		The totaal number of samples in the dataset, or a list of per-split sizes

Here are two example dataset YAML files:

Download and prepare ¶¶

All dataset’s __init__.py files must define a download_and_prepare() method with the signature below:

def download_and_prepare(dataset_dir, split=None, **kwargs):
    """Downloads the dataset and prepares it for loading into FiftyOne.

    Args:
        dataset_dir: the directory in which to construct the dataset
        split (None): a specific split to download, if the dataset supports
            splits. The supported split values are defined by the dataset's
            YAML file
        **kwargs: optional keyword arguments that your dataset can define to
            configure what/how the download is performed

    Returns:
        a tuple of

        -   ``dataset_type``: a ``fiftyone.types.Dataset`` type that the
            dataset is stored in locally, or None if the dataset provides
            its own ``load_dataset()`` method
        -   ``num_samples``: the total number of downloaded samples for the
            dataset or split
        -   ``classes``: a list of classes in the dataset, or None if not
            applicable
    """

    # Download files and organize them in `dataset_dir`
    ...

    # Define how the data is stored
    dataset_type = fo.types.ImageClassificationDirectoryTree
    dataset_type = None  # custom ``load_dataset()`` method

    # Indicate how many samples have been downloaded
    # May be less than the total size if partial downloads have been used
    num_samples = 10000

    # Optionally report what classes exist in the dataset
    classes = None
    classes = ["cat", "dog", ...]

    return dataset_type, num_samples, classes

This method is called under-the-hood when a user calls download_zoo_dataset() or load_zoo_dataset(), and its job is to download any relevant files from the web and organize and/or prepare them as necessary into a format that’s ready to be loaded into a FiftyOne dataset.

The dataset_type that download_and_prepare() returns defines how it the dataset is ultimately loaded into FiftyOne:

Built-in importer: in many cases, FiftyOne already contains a built-in importer that can be leveraged to load data on disk into FiftyOne. Remotely-sourced datasets can take advantage of this by simply returning the appropriate dataset_type from download_and_prepare(), which is then used to load the data as follows:

# If the dataset has splits, `dataset_dir` will be the split directory
dataset_importer_cls = dataset_type.get_dataset_importer_cls()
dataset_importer = dataset_importer_cls(dataset_dir=dataset_dir, **kwargs)

dataset.add_importer(dataset_importer, **kwargs)

Custom loader: if dataset_type=None is returned, then __init__.py must also contain a load_dataset() method as described below that handles loading the data into FiftyOne as follows:

load_dataset(dataset, dataset_dir, **kwargs)

Load dataset ¶¶

Datasets that don’t use a built-in importer must also define a load_dataset() method in their __init__.py with the signature below:

def load_dataset(dataset, dataset_dir, split=None, **kwargs):
    """Loads the dataset into the given FiftyOne dataset.

    Args:
        dataset: a :class:`fiftyone.core.dataset.Dataset` to which to import
        dataset_dir: the directory to which the dataset was downloaded
        split (None): a split to load. The supported values are
            ``("train", "validation", "test")``
        **kwargs: optional keyword arguments that your dataset can define to
            configure what/how the load is performed
    """

    # Load data into samples
    samples = [...]

    # Add samples to the dataset
    dataset.add_samples(samples)

This method’s job is to load the filepaths and any relevant labels into Sample objects and then call add_samples() or a similar method to add them to the provided Dataset.

Partial downloads ¶¶

Remotely-sourced datasets can support partial downloads, which is useful for a variety of reasons, including:

A dataset may contain labels for multiple task types but the user is only interested in a subset of them
The dataset may be very large and the user only wants to download a small subset of the samples to get familiar with the dataset

Datasets that support partial downloads should declare this in their fiftyone.yml:

supports_partial_downloads: true

The partial download behavior itself is defined via **kwargs in the dataset’s __init__.py methods:

def download_and_prepare(dataset_dir, split=None, **kwargs):
    pass

def load_dataset(dataset, dataset_dir, split=None, **kwargs):
    pass

When download_zoo_dataset(url, ..., **kwargs) is called, any kwargs declared by download_and_prepare() are passed through to it.

When load_zoo_dataset(name_or_url, ..., **kwargs) is called, any kwargs declared by download_and_prepare() and load_dataset() are passed through to them, respectively.

Note

Check out voxel51/coco-2017 for an example of a remotely-sourced dataset that supports partial downloads.