Skip to content

FiftyOne Brain

The FiftyOne Brain provides powerful machine learning techniques that are designed to transform how you curate your data from an art into a measurable science.

Note

Did you know? You can execute Brain methods from the FiftyOne App by installing the @voxel51/brain plugin!

The FiftyOne Brain methods are useful across the stages of the machine learning workflow:

  • Visualizing embeddings: Tired of combing through individual images/videos and staring at aggregate performance metrics trying to figure out how to improve the performance of your model? Using FiftyOne to visualize your dataset in a low-dimensional embedding space can reveal patterns and clusters in your data that can help you answer many important questions about your data, from identifying the most critical failure modes of your model, to isolating examples of critical scenarios, to recommending new samples to add to your training dataset, and more!

  • Similarity: When constructing a dataset or training a model, have you ever wanted to find similar examples to an image or object of interest? For example, you may have found a failure case of your model and now want to search for similar scenarios in your evaluation set to diagnose the issue, or you want to mine your data lake to augment your training set to fix the issue. Use the FiftyOne Brain to index your data by similarity and you can easily query and sort your datasets to find similar examples, both programmatically and via point-and-click in the App.

  • Leaky splits: Often when sourcing data en masse, duplicates and near duplicates can slip through the cracks. The FiftyOne Brain offers a leaky splits analysis that can be used to find potential leaks between dataset splits. Such leaks can be misleading when evaluating a model, giving an overly optimistic measure for the quality of training.

  • Near duplicates: When curating massive datasets, you may inadvertently add near duplicate data to your datasets, which can bias or otherwise confuse your models. The FiftyOne Brain offers a near duplicate detection algorithm that automatically surfaces such data quality issues and prompts you to take action to resolve them.

  • Exact duplicates: Despite your best efforts, you may accidentally add duplicate data to a dataset. The FiftyOne Brain provides an exact duplicate detection method that scans your data and alerts you if a dataset contains duplicate samples, either under the same or different filenames.

  • Uniqueness: During the training loop for a model, the best results will be seen when training on unique data. The FiftyOne Brain provides a uniqueness measure for images that compare the content of every image in a dataset with all other images. Uniqueness operates on raw images and does not require any prior annotation on the data. It is hence very useful in the early stages of the machine learning workflow when you are likely asking “What data should I select to annotate?”

  • Mistakenness: Annotations mistakes create an artificial ceiling on the performance of your models. However, finding these mistakes by hand is at least as arduous as the original annotation was, especially in cases of larger datasets. The FiftyOne Brain provides a quantitative mistakenness measure to identify possible label mistakes. Mistakenness operates on labeled images and requires the logit-output of your model predictions in order to provide maximum efficacy. It also works on detection datasets to find missed objects, incorrect annotations, and localization issues.

  • Hardness: While a model is training, it will learn to understand attributes of certain samples faster than others. The FiftyOne Brain provides a hardness measure that calculates how easy or difficult it is for your model to understand any given sample. Mining hard samples is a tried and true measure of mature machine learning processes. Use your current model instance to compute predictions on unlabeled samples to determine which are the most valuable to have annotated and fed back into the system as training samples, for example.

  • Representativeness: When working with large datasets, it can be hard to determine what samples within it are outliers and which are more typical. The FiftyOne Brain offers a representativeness measure that can be used to find the most common types of images in your dataset. This is especially helpful to find easy examples to train on in your data and for visualizing common modes of the data.

Note

Check out the tutorials page for detailed examples demonstrating the use of many Brain capabilities.

Visualizing embeddings

The FiftyOne Brain provides a powerful compute_visualization() method that you can use to generate low-dimensional representations of the samples and/or individual objects in your datasets.

These representations can be visualized natively in the App’s Embeddings panel, where you can interactively select points of interest and view the corresponding samples/labels of interest in the Samples panel, and vice versa.

mnist

There are two primary components to an embedding visualization: the method used to generate the embeddings, and the dimensionality reduction method used to compute a low-dimensional representation of the embeddings.

Embedding methods

The embeddings and model parameters of compute_visualization() support a variety of ways to generate embeddings for your data:

  • Provide nothing, in which case a default general purpose model is used to embed your data

  • Provide a Model instance or the name of any model from the Model Zoo that supports embeddings

  • Provide your own precomputed embeddings in array form

  • Provide the name of a VectorField or ArrayField of your dataset in which precomputed embeddings are stored

Dimensionality reduction methods

The method parameter of compute_visualization() allows you to specify the dimensionality reduction method to use. The supported methods are:

  • umap ( default): Uniform Manifold Approximation and Projection ( UMAP)

  • tsne: t-distributed Stochastic Neighbor Embedding ( t-SNE)

  • pca: Principal Component Analysis ( PCA)

  • manual: provide a manually computed low-dimensional representation

import fiftyone.brain as fob

results = fob.compute_visualization(
    dataset,
    method="umap",  # "umap", "tsne", "pca", etc
    brain_key="...",
    ...
)

Note

When you use the default UMAP method for the first time, you will be prompted to install the umap-learn package.

Note

Refer to this section for more information about creating visualization runs.

Applications

How can embedding-based visualization of your data be used in practice? These visualizations often uncover hidden structure in you data that has important semantic meaning depending on the data you use to color/size the points.

Here are a few of the many possible applications:

  • Identifying anomalous and/or visually similar examples

  • Uncovering patterns in incorrect/spurious predictions

  • Finding examples of target scenarios in your data lake

  • Mining hard examples for your evaluation pipeline

  • Recommending samples from your data lake for classes that need additional training data

  • Unsupervised pre-annotation of training data

The best part about embedding visualizations is that you will likely discover more applications specific to your use case when you try it out on your data!

Note

Check out the image embeddings tutorial to see example uses of the Brain’s embeddings-powered visualization methods to uncover hidden structure in datasets.

Image embeddings example

The following example gives a taste of the powers of visual embeddings in FiftyOne using the BDD100K dataset from the dataset zoo, embeddings generated by a mobilenet model from the model zoo, and the default UMAP dimensionality reduction method.

In this setup, the scatterpoints in the Embeddings panel correspond to images in the validation split colored by the time of day labels provided by the BDD100K dataset. When points are lasso-ed in the plot, the corresponding samples are automatically selected in the Samples panel:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# The BDD dataset must be manually downloaded. See the zoo docs for details
source_dir = "/path/to/dir-with-bdd100k-files"

dataset = foz.load_zoo_dataset(
    "bdd100k", split="validation", source_dir=source_dir,
)

# Compute embeddings
# You will likely want to run this on a machine with GPU, as this requires
# running inference on 10,000 images
model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")
embeddings = dataset.compute_embeddings(model)

# Compute visualization
results = fob.compute_visualization(
    dataset, embeddings=embeddings, seed=51, brain_key="img_viz"
)

session = fo.launch_app(dataset)

Note

Did you know? You can programmatically configure your Spaces layout!

image-visualization

The GIF shows the variety of insights that are revealed by running this simple protocol:

  • The first cluster of points selected reveals a set of samples whose field of view is corrupted by hardware gradients at the top and bottom of the image

  • The second cluster of points reveals a set of images in rainy conditions with water droplets on the windshield

  • Hiding the primary cluster of daytime points and selecting the remaining night points reveals that the night points have incorrect labels

Object embeddings example

The following example demonstrates how embeddings can be used to visualize the ground truth objects in the quickstart dataset using the compute_visualization() method’s default embeddings model and dimensionality method.

In this setup, we generate a visualization for all ground truth objects, but then we create a view that restricts the visualization to only objects in a subset of the classes. The scatterpoints in the Embeddings panel correspond to objects, colored by their label. When points are lasso-ed in the plot, the corresponding object patches are automatically selected in the Samples panel:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")

# Generate visualization for `ground_truth` objects
results = fob.compute_visualization(
    dataset, patches_field="ground_truth", brain_key="gt_viz"
)

# Restrict to the 10 most common classes
counts = dataset.count_values("ground_truth.detections.label")
classes = sorted(counts, key=counts.get, reverse=True)[:10]
view = dataset.filter_labels("ground_truth", F("label").is_in(classes))

session = fo.launch_app(view)

Note

Did you know? You can programmatically configure your Spaces layout!

object-visualization

As you can see, the coloring of the scatterpoints allows you to discover natural clusters of objects, such as visually similar carrots or kites in the air.

Visualization API

This section describes how to setup, create, and manage visualizations in detail.

Changing your visualization method

You can use a specific dimensionality reduction method for a particular visualization run by passing the method parameter to compute_visualization():

index = fob.compute_visualization(..., method="<method>", ...)

Alternatively, you can change your default dimensionality reduction method for an entire session by setting the FIFTYONE_BRAIN_DEFAULT_VISUALIZATION_METHOD environment variable:

export FIFTYONE_BRAIN_DEFAULT_VISUALIZATION_METHOD=<method>

Finally, you can permanently change your default dimensionality reduction method by updating the default_visualization_method key of your brain config at ~/.fiftyone/brain_config.json:

{
    "default_visualization_method": "<method>",
    "visualization_methods": {
        "<method>": {...},
        ...
    }
}

Configuring your visualization method

Dimensionality reduction methods may be configured in a variety of method-specific ways, which you can see by inspecting the parameters of a method’s associated VisualizationConfig class.

The relevant classes for the builtin dimensionality reduction methods are:

  • umap: fiftyone.brain.visualization.UMAPVisualizationConfig

  • tsne: fiftyone.brain.visualization.TSNEVisualizationConfig

  • pca: fiftyone.brain.visualization.PCAVisualizationConfig

  • manual: fiftyone.brain.visualization.ManualVisualizationConfig

You can configure a dimensionality reduction method’s parameters for a specific run by simply passing supported config parameters as keyword arguments each time you call compute_visualization():

index = fob.compute_visualization(
    ...
    method="umap",
    min_dist=0.2,
)

Alternatively, you can more permanently configure your dimensionality reduction method(s) via your brain config.

Similarity

The FiftyOne Brain provides a compute_similarity() method that you can use to index the images or object patches in a dataset by similarity.

Once you’ve indexed a dataset by similarity, you can use the sort_by_similarity() view stage to programmatically sort your dataset by similarity to any image(s) or object patch(es) of your choice in your dataset. In addition, the App provides a convenient point-and-click interface for sorting by similarity with respect to an index on a dataset.

Note

Did you know? You can search by natural language using similarity indexes!

Embedding methods

Like embeddings visualization, similarity leverages deep embeddings to generate an index for a dataset.

The embeddings and model parameters of compute_similarity() support a variety of ways to generate embeddings for your data:

  • Provide nothing, in which case a default general purpose model is used to index your data

  • Provide a Model instance or the name of any model from the Model Zoo that supports embeddings

  • Provide your own precomputed embeddings in array form

  • Provide the name of a VectorField or ArrayField of your dataset in which precomputed embeddings are stored

Similarity backends

By default, all similarity indexes are served using a builtin scikit-learn backend, but you can pass the optional backend parameter to compute_similarity() to switch to another supported backend:

import fiftyone.brain as fob

results = fob.compute_similarity(
    dataset,
    backend="sklearn",  # "sklearn", "qdrant", "redis", etc
    brain_key="...",
    ...
)

Note

Refer to this section for more information about creating, managing and deleting similarity indexes.

Image similarity

This section demonstrates the basic workflow of:

  • Indexing an image dataset by similarity

  • Using the App’s image similarity UI to query by visual similarity

  • Using the SDK’s sort_by_similarity() view stage to programmatically query the index

To index a dataset by image similarity, pass the Dataset or DatasetView of interest to compute_similarity() along with a name for the index via the brain_key argument.

Next load the dataset in the App and select some image(s). Whenever there is an active selection in the App, a similarity icon will appear above the grid, enabling you to sort by similarity to your current selection.

You can use the advanced settings menu to choose between multiple brain keys and optionally specify a maximum number of matches to return ( k) and whether to query by greatest or least similarity (if supported).

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

# Index images by similarity
fob.compute_similarity(
    dataset,
    model="clip-vit-base32-torch",
    brain_key="img_sim",
)

session = fo.launch_app(dataset)

Note

In the example above, we specify a zoo model with which to generate embeddings, but you can also provide precomputed embeddings.

image-similarity

Alternatively, you can use the sort_by_similarity() view stage to programmatically construct a view that contains the sorted results:

# Choose a random image from the dataset
query_id = dataset.take(1).first().id

# Programmatically construct a view containing the 15 most similar images
view = dataset.sort_by_similarity(query_id, k=15, brain_key="img_sim")

session.view = view

Note

Performing a similarity search on a DatasetView will only return results from the view; if the view contains samples that were not included in the index, they will never be included in the result.

This means that you can index an entire Dataset once and then perform searches on subsets of the dataset by constructing views that contain the images of interest.

Note

For large datasets, you may notice longer load times the first time you use a similarity index in a session. Subsequent similarity searches will use cached results and will be faster!

Object similarity

This section demonstrates the basic workflow of:

  • Indexing a dataset of objects by similarity

  • Using the App’s object similarity UI to query by visual similarity

  • Using the SDK’s sort_by_similarity() view stage to programmatically query the index

You can index any objects stored on datasets in Detection, Detections, Polyline, or Polylines format. See this section for more information about adding labels to your datasets.

To index by object patches, simply pass the Dataset or DatasetView of interest to compute_similarity() along with the name of the patches field and a name for the index via the brain_key argument.

Next load the dataset in the App and switch to object patches view by clicking the patches icon above the grid and choosing the label field of interest from the dropdown.

Now whenever you have selected one or more patches in the App, a similarity icon will appear above the grid, enabling you to sort by similarity to your current selection.

You can use the advanced settings menu to choose between multiple brain keys and optionally specify a maximum number of matches to return ( k) and whether to query by greatest or least similarity (if supported).

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

# Index ground truth objects by similarity
fob.compute_similarity(
    dataset,
    patches_field="ground_truth",
    model="clip-vit-base32-torch",
    brain_key="gt_sim",
)

session = fo.launch_app(dataset)

Note

In the example above, we specify a zoo model with which to generate embeddings, but you can also provide precomputed embeddings.

object-similarity

Alternatively, you can directly use the sort_by_similarity() view stage to programmatically construct a view that contains the sorted results:

# Convert to patches view
patches = dataset.to_patches("ground_truth")

# Choose a random patch object from the dataset
query_id = patches.take(1).first().id

# Programmatically construct a view containing the 15 most similar objects
view = patches.sort_by_similarity(query_id, k=15, brain_key="gt_sim")

session.view = view

Note

Performing a similarity search on a DatasetView will only return results from the view; if the view contains objects that were not included in the index, they will never be included in the result.

This means that you can index an entire Dataset once and then perform searches on subsets of the dataset by constructing views that contain the objects of interest.

Note

For large datasets, you may notice longer load times the first time you use a similarity index in a session. Subsequent similarity searches will use cached results and will be faster!

Text similarity

When you create a similarity index powered by the CLIP model, you can also search by arbitrary natural language queries natively in the App!

text-similarity

You can also perform text queries via the SDK by passing a prompt directly to sort_by_similarity() along with the brain_key of a compatible similarity index:

Note

In general, any custom model that is made available via the model zoo interface that implements the PromptMixin interface can support text similarity queries!

Similarity API

This section describes how to setup, create, and manage similarity indexes in detail.

Changing your similarity backend

You can use a specific backend for a particular similarity index by passing the backend parameter to compute_similarity():

index = fob.compute_similarity(..., backend="<backend>", ...)

Alternatively, you can change your default similarity backend for an entire session by setting the FIFTYONE_BRAIN_DEFAULT_SIMILARITY_BACKEND environment variable.

export FIFTYONE_BRAIN_DEFAULT_SIMILARITY_BACKEND=<backend>

Finally, you can permanently change your default similarity backend by updating the default_similarity_backend key of your brain config at ~/.fiftyone/brain_config.json:

{
    "default_similarity_backend": "<backend>",
    "similarity_backends": {
        "<backend>": {...},
        ...
    }
}

Configuring your backend

Similarity backends may be configured in a variety of backend-specific ways, which you can see by inspecting the parameters of a backend’s associated SimilarityConfig class.

The relevant classes for the builtin similarity backends are:

  • sklearn: fiftyone.brain.internal.core.sklearn.SklearnSimilarityConfig

  • qdrant: fiftyone.brain.internal.core.qdrant.QdrantSimilarityConfig

  • redis: fiftyone.brain.internal.core.redis.RedisSimilarityConfig

  • pinecone: fiftyone.brain.internal.core.pinecone.PineconeSimilarityConfig

  • mongodb: fiftyone.brain.internal.core.mongodb.MongoDBSimilarityConfig

  • elasticsearch: a fiftyone.brain.internal.core.elasticsearch.ElasticsearchSimilarityConfig

  • milvus: fiftyone.brain.internal.core.milvus.MilvusSimilarityConfig

  • lancedb: fiftyone.brain.internal.core.lancedb.LanceDBSimilarityConfig

You can configure a similarity backend’s parameters for a specific index by simply passing supported config parameters as keyword arguments each time you call compute_similarity():

index = fob.compute_similarity(
    ...
    backend="qdrant",
    url="http://localhost:6333",
)

Alternatively, you can more permanently configure your backend(s) via your brain config.

Creating an index

The compute_similarity() method provides a number of different syntaxes for initializing a similarity index. Let’s see some common patterns on the quickstart dataset:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")
Default behavior

With no arguments, embeddings will be automatically computed for all images or patches in the dataset using a default model and added to a new index in your default backend:

Custom model, custom backend, add embeddings later

With the syntax below, we’re specifying a similarity backend of our choice, specifying a custom model from the Model Zoo to use to generate embeddings, and using the embeddings=False syntax to create the index without initially adding any embeddings to it:

Precomputed embeddings

You can pass precomputed image or object embeddings to compute_similarity() via the embeddings argument:

Adding embeddings to an index

You can use add_to_index() to add new embeddings or overwrite existing embeddings in an index at any time:

Note

When using the default sklearn backend, you must manually call save() after adding or removing embeddings from an index in order to save the index to the database. This is not required when using external vector databases like Qdrant.

Note

Did you know? If you provided the name of a zoo model when creating the similarity index, you can use get_model() to load the model later. Or, you can use compute_embeddings() to conveniently generate embeddings for new samples/objects using the index’s model.

Retrieving embeddings in an index

You can use get_embeddings() to retrieve the embeddings for any or all IDs of interest from an existing index:

Removing embeddings from an index

You can use remove_from_index() to delete embeddings from an index by their ID:

Note

When using the default sklearn backend, you must manually call save() after adding or removing embeddings from an index in order to save the index to the database.

This is not required when using external vector databases like Qdrant.

Deleting an index

When working with backends like Qdrant that leverage external vector databases, you can call cleanup() to delete the external index/collection:

Note

Calling cleanup() has no effect when working with the default sklearn backend. The index is deleted only when you call delete_brain_run().

Applications

How can similarity be used in practice? A common pattern is to mine your dataset for similar examples to certain images or object patches of interest, e.g., those that represent failure modes of a model that need to be studied in more detail or underrepresented classes that need more training examples.

Here are a few of the many possible applications:

  • Pruning near-duplicate images from your training dataset

  • Identifying failure patterns of a model

  • Finding examples of target scenarios in your data lake

  • Mining hard examples for your evaluation pipeline

  • Recommending samples from your data lake for classes that need additional training data

Leaky splits

Despite our best efforts, duplicates and other forms of non-IID samples show up in our data. When these samples end up in different splits, this can have consequences when evaluating a model. It can often be easy to overestimate model capability due to this issue. The FiftyOne Brain offers a way to identify such cases in dataset splits.

The leaks of a dataset can be computed directly without the need for the predictions of a pre-trained model via the compute_leaky_splits() method:

import fiftyone as fo
import fiftyone.brain as fob

dataset = fo.load_dataset(...)

# Splits defined via tags
split_tags = ["train", "test"]
index = fob.compute_leaky_splits(dataset, splits=split_tags)
leaks = index.leaks_view()

# Splits defined via field
split_field = "split"  # holds split values e.g. 'train' or 'test'
index = fob.compute_leaky_splits(dataset, splits=split_field)
leaks = index.leaks_view()

# Splits defined via views
split_views = {"train": train_view, "test": test_view}
index = fob.compute_leaky_splits(dataset, splits=split_views)
leaks = index.leaks_view()

Notice how the splits of the dataset can be defined in three ways: through sample tags, through a string field that assigns each split a unique value in the field, or by directly providing views that define the splits.

Input: A Dataset or DatasetView, and a definition of splits through one of tags, a field, or views.

Output: An index that will allow you to look through your leaks with leaks_view() and also provides some useful actions once they are discovered such as automatically cleaning the dataset with no_leaks_view() or tagging the leaks for the future action with tag_leaks().

What to expect: Leaky splits works by embedding samples with a powerful model and finding very close samples in different splits in this space. Large, powerful models that were not trained on a dataset can provide insight into visual and semantic similarity between images, without creating further leaks in the process.

Similarity index: Under the hood, leaky splits leverages the brain’s SimilarityIndex to detect leaks. Any similarity backend that implements the DuplicatesMixin can be used to compute leaky splits. You can either pass an existing similarity index by passing its brain key to the argument similarity_index, or have the method create one on the fly for you.

Embeddings: You can customize the model used to compute embeddings via the model argument of compute_leaky_splits(). You can also precompute embeddings and tell leaky splits to use them by passing them via the embeddings argument.

Thresholds: Leaky splits uses a threshold to decide what samples are too close and thus mark them as potential leaks. This threshold can be customized either by passing a value to the threshold argument of compute_leaky_splits(). The best value for your use case may vary depending on your dataset, as well as the embeddings used. A threshold that’s too big may have a lot of false positives, while a threshold that’s too small may have a lot of false negatives.

The example code below runs leaky splits analysis on the COCO dataset. Try it for yourself and see what you find!

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
import fiftyone.utils.random as four

# Load some COCO data
dataset = foz.load_zoo_dataset("coco-2017", split="test")

# Set up splits via tags
dataset.untag_samples(dataset.distinct("tags"))
four.random_split(dataset, {"train": 0.7, "test": 0.3})

# Find leaks
index = fob.compute_leaky_splits(dataset, splits=["train", "test"])
leaks = index.leaks_view()

The leaks_view() method returns a view that contains only the leaks in the input splits. Once you have these leaks, it is wise to look through them. You may gain some insight into the source of the leaks:

session = fo.launch_app(leaks)

Before evaluating your model on your test set, consider getting a version of it with the leaks removed. This can be easily done via no_leaks_view():

# The original test split
test_set = index.split_views["test"]

# The test set with leaks removed
test_set_no_leaks = index.no_leaks_view(test_set)

session.view = test_set_no_leaks

Performance on the clean test set will can be closer to the performance of the model in the wild. If you found some leaks in your dataset, consider comparing performance on the base test set against the clean test set.

leaky-splits

Near duplicates

When curating massive datasets, you may inadvertently add near duplicate data to your datasets, which can bias or otherwise confuse your models.

The compute_near_duplicates() method leverages embeddings to automatically surface near-duplicate samples in your dataset:

import fiftyone as fo
import fiftyone.brain as fob

dataset = fo.load_dataset(...)

index = fob.compute_near_duplicates(dataset)
print(index.duplicate_ids)

dups_view = index.duplicates_view()
session = fo.launch_app(dups_view)

Input: An unlabeled (or labeled) dataset. There are recipes for building datasets from a wide variety of image formats, ranging from a simple directory of images to complicated dataset structures like COCO.

Output: A SimilarityIndex object that provides powerful methods such as duplicate_ids, neighbors_map and duplicates_view() to analyze potential near duplicates as demonstrated below

What to expect: Near duplicates analysis leverages embeddings to identify samples that are too close to their nearest neighbors. You can provide pre-computed embeddings, specify a zoo model of your choice to use to compute embeddings, or provide nothing and rely on the method’s default model to generate embeddings.

Thresholds: When using custom embeddings/models, you may need to adjust the distance threshold used to detect potential duplicates. You can do this by passing a value to the threshold argument of compute_near_duplicates(). The best value for your use case may vary depending on your dataset, as well as the embeddings used. A threshold that’s too big may have a lot of false positives, while a threshold that’s too small may have a lot of false negatives.

The following example demonstrates how to use compute_near_duplicates() to detect near duplicate images on the CIFAR-10 dataset:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("cifar10", split="test")

To proceed, we first need some suitable image embeddings for the dataset. Although the compute_near_duplicates() method is equipped with a default general-purpose model to generate embeddings if none are provided, you’ll typically find higher-quality insights when a domain-specific model is used to generate embeddings.

In this case, we’ll use a classifier that has been fine-tuned on CIFAR-10 to pre-compute embeddings and them feed them to compute_near_duplicates():

import fiftyone.brain as fob
import fiftyone.brain.internal.models as fbm

# Compute embeddings via a pre-trained CIFAR-10 classifier
model = fbm.load_model("simple-resnet-cifar10")
embeddings = dataset.compute_embeddings(model, batch_size=16)

# Scan for near-duplicates
index = fob.compute_near_duplicates(
    dataset,
    embeddings=embeddings,
    thresh=0.02,
)

Finding near-duplicate samples

The neighbors_map property of the index provides a data structure that summarizes the findings. The keys of the dictionary are the sample IDs of each non-duplicate sample, and the values are lists of (id, distance) tuples listing the sample IDs of the duplicate samples for each reference sample together with the embedding distance between the two samples:

print(index.neighbors_map)
{
    '61143408db40df926c571a6b': [\
        ('61143409db40df926c573075', 5.667297674385298),\
        ('61143408db40df926c572ab6', 6.231051661334058)\
    ],
    '6114340cdb40df926c577f2a': [\
        ('61143408db40df926c572b54', 6.042934361555487)\
    ],
    '61143408db40df926c572aa3': [\
        ('6114340bdb40df926c5772e9', 5.88984758067434),\
        ('61143408db40df926c572b64', 6.063986454046798),\
        ('61143409db40df926c574571', 6.10303338363576),\
        ('6114340adb40df926c5749a2', 6.161749290179865)\
    ],
    ...
}

We can conveniently visualize this information in the App via the duplicates_view() method of the index, which constructs a view with the duplicate samples arranged directly after their corresponding reference sample, with optional additional fields recording the type and nearest reference sample ID/distance:

duplicates_view = index.duplicates_view(
    type_field="dup_type",
    id_field="dup_id",
    dist_field="dup_dist",
)

session = fo.launch_app(duplicates_view)

cifar10-duplicate-view

Note

You can also use the find_duplicates() method of the index to rerun the duplicate detection with a different threshold without calling compute_near_duplicates() again.

Finding maximally unique samples

You can also use the find_unique() method of the index to identify a set of samples of any desired size that are maximally unique with respect to each other:

# Use the similarity index to identify 500 maximally unique samples
index.find_unique(500)
print(index.unique_ids[:5])

We can also conveniently visualize the results of this operation via the visualize_unique() method of the index, which generates a scatterplot with the unique samples colored separately:

# Generate a 2D visualization
viz_results = fob.compute_visualization(dataset, embeddings=embeddings)

# Visualize the unique samples in embeddings space
plot = index.visualize_unique(viz_results)
plot.show(height=800, yaxis_scaleanchor="x")

cifar10-unique-viz

And of course we can load a view containing the unique samples in the App to explore the results in detail:

# Visualize the unique images in the App
unique_view = dataset.select(index.unique_ids)
session = fo.launch_app(view=unique_view)

cifar10-unique-view

Exact duplicates

Despite your best efforts, you may accidentally add duplicate data to a dataset. Left unmitigated, such quality issues can bias your models and confound your analysis.

The compute_exact_duplicates() method scans your dataset and determines if you have duplicate data either under the same or different filenames:

import fiftyone as fo
import fiftyone.brain as fob

dataset = fo.load_dataset(...)

duplicates_map = fob.compute_exact_duplicates(dataset)
print(duplicates_map)

Input: An unlabeled (or labeled) dataset. There are recipes for building datasets from a wide variety of image formats, ranging from a simple directory of images to complicated dataset structures like COCO.

Output: A dictionary mapping IDs of samples with exact duplicates to lists of IDs of the duplicates for the corresponding sample

What to expect: Exact duplicates analysis uses filehases to identify duplicate data, regardless of whether they are stored under the same or different filepaths in your dataset.

Image uniqueness

The FiftyOne Brain allows for the computation of the uniqueness of an image, in comparison with other images in a dataset; it does so without requiring any model from you. One good use of uniqueness is in the early stages of the machine learning workflow when you are deciding what subset of data with which to bootstrap your models. Unique samples are vital in creating training batches that help your model learn as efficiently and effectively as possible.

The uniqueness of a Dataset can be computed directly without need the predictions of a pre-trained model via the compute_uniqueness() method:

import fiftyone as fo
import fiftyone.brain as fob

dataset = fo.load_dataset(...)

fob.compute_uniqueness(dataset)

Input: An unlabeled (or labeled) image dataset. There are recipes for building datasets from a wide variety of image formats, ranging from a simple directory of images to complicated dataset structures like COCO.

Note

Did you know? Instead of using FiftyOne’s default model to generate embeddings, you can provide your own embeddings or specify a model from the Model Zoo to use to generate embeddings via the optional embeddings and model argument to compute_uniqueness().

Output: A scalar-valued uniqueness field is populated on each sample that ranks the uniqueness of that sample (higher value means more unique). The uniqueness values for a dataset are normalized to [0, 1], with the most unique sample in the collection having a uniqueness value of 1.

You can customize the name of this field by passing the optional uniqueness_field argument to compute_uniqueness().

What to expect: Uniqueness uses a tuned algorithm that measures the distribution of each Sample in the Dataset. Using this distribution, it ranks each sample based on its relative similarity to other samples. Those that are close to other samples are not unique whereas those that are far from most other samples are more unique.

Note

Did you know? You can specify a region of interest within each image to use to compute uniqueness by providing the optional roi_field argument to compute_uniqueness(), which contains Detections or Polylines that define the ROI for each sample.

Note

Check out the uniqueness tutorial to see an example use case of the Brain’s uniqueness method to detect near-duplicate images in a dataset.

uniqueness

Label mistakes

Label mistakes can be calculated for both classification and detection datasets.

mistakenness

Sample hardness

During training, it is useful to identify samples that are more difficult for a model to learn so that training can be more focused around these hard samples. These hard samples are also useful as seeds when considering what other new samples to add to a training dataset.

In order to compute hardness, all you need to do is add your model predictions and their logits to your FiftyOne Dataset and then run the compute_hardness() method:

import fiftyone as fo
import fiftyone.brain as fob

dataset = fo.load_dataset(...)

fob.compute_hardness(dataset, "predictions")

Input: A Dataset or DatasetView on which predictions have been computed and are stored in the "predictions" argument. Ground truth annotations are not required for hardness.

Output: A scalar-valued hardness field is populated on each sample that ranks the hardness of the sample. You can customize the name of this field via the hardness_field argument of compute_hardness().

What to expect: Hardness is computed in the context of a prediction model. The FiftyOne Brain hardness measure defines hard samples as those for which the prediction model is unsure about what label to assign. This measure incorporates prediction confidence and logits in a tuned model that has demonstrated empirical value in many model training exercises.

Note

Check out the classification evaluation tutorial to see example uses of the Brain’s hardness method to uncover annotation mistakes in a dataset.

hardness

Image representativeness

During the early stages of the ML workflow it can be useful to find prototypical samples in your data that accurately describe all the different aspects of your data. FiftyOne Brain provides a representativeness method that finds samples which are very similar to large clusters of your data. Highly representative samples are great for finding modes or easy examples in your dataset.

The representativeness of a Dataset can be computed directly without the need for the predictions of a pre-trained model via the compute_representativeness() method:

import fiftyone as fo
import fiftyone.brain as fob

dataset = fo.load_dataset(...)

fob.compute_representativeness(dataset)

Input: An unlabeled (or labeled) image dataset. There are recipes for building datasets from a wide variety of image formats, ranging from a simple directory of images to complicated dataset structures like COCO.

Output: A scalar-valued representativeness field is populated for each sample that ranks the representativeness of that sample (higher value means more representative). The representativeness values for a dataset are normalized to [0, 1], with the most representative samples in the collection having a representativeness value of 1.

You can customize the name of this field by passing the optional representativeness_field argument to compute_representativeness() .

What to expect: Representativeness uses a clustering algorithm to find similar looking groups of samples. The representativeness is then computed based on each sample’s proximity to the computed cluster centers, farther samples being less representative and closer samples being more representative.

Note

Did you know? You can specify a region of interest within each image to use to compute representativeness by providing the optional roi_field argument to compute_representativeness(), which contains Detections or Polylines that define the ROI for each sample.

representativeness

Managing brain runs

When you run a brain method with a brain_key argument, the run is recorded on the dataset and you can retrieve information about it later, rename it, delete it (along with any modifications to your dataset that were performed by it), and even retrieve the view that you computed on using the following methods on your dataset:

The example below demonstrates the basic interface:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

view = dataset.take(100)

# Run a brain method that returns results
results = fob.compute_visualization(view, brain_key="visualization")

# Run a brain method that populates a new sample field on the dataset
fob.compute_uniqueness(view)

# List the brain methods that have been run
print(dataset.list_brain_runs())
# ['visualization', 'uniqueness']

# Print information about a brain run
print(dataset.get_brain_info("visualization"))

# Load the results of a previous brain run
also_results = dataset.load_brain_results("visualization")

# Load the view on which a brain run was performed
same_view = dataset.load_brain_view("visualization")

# Rename a brain run
dataset.rename_brain_run("visualization", "still_visualization")

# Delete brain runs
# This will delete any stored results and fields that were populated
dataset.delete_brain_run("still_visualization")
dataset.delete_brain_run("uniqueness")

Brain config

FiftyOne provides a brain config that you can use to either temporarily or permanently configure the behavior of brain methods.

Viewing your config

You can print your current brain config at any time via the Python library and the CLI:

Note

If you have customized your brain config via any of the methods described below, printing your config is a convenient way to ensure that the changes you made have taken effect as you expected.

Modifying your config

You can modify your brain config in a variety of ways. The following sections describe these options in detail.

Order of precedence

The following order of precedence is used to assign values to your brain config settings as runtime:

  1. Config settings applied at runtime by directly editing fiftyone.brain.brain_config

  2. FIFTYONE_BRAIN_XXX environment variables

  3. Settings in your JSON config ( ~/.fiftyone/brain_config.json)

  4. The default config values

Editing your JSON config

You can permanently customize your brain config by creating a ~/.fiftyone/brain_config.json file on your machine. The JSON file may contain any desired subset of config fields that you wish to customize.

For example, the following config JSON file customizes the URL of your Qdrant server without changing any other default config settings:

{
    "similarity_backends": {
        "qdrant": {
            "url": "http://localhost:8080"
        }
    }
}

When fiftyone.brain is imported, any options from your JSON config are merged into the default config, as per the order of precedence described above.

Note

You can customize the location from which your JSON config is read by setting the FIFTYONE_BRAIN_CONFIG_PATH environment variable.

Setting environment variables

Brain config settings may be customized on a per-session basis by setting the FIFTYONE_BRAIN_XXX environment variable(s) for the desired config settings.

The FIFTYONE_BRAIN_DEFAULT_SIMILARITY_BACKEND environment variable allows you to configure your default similarity backend:

export FIFTYONE_BRAIN_DEFAULT_SIMILARITY_BACKEND=qdrant

Similarity backends

You can declare parameters for specific similarity backends by setting environment variables of the form FIFTYONE_BRAIN_SIMILARITY_<BACKEND>_<PARAMETER>. Any settings that you declare in this way will be passed as keyword arguments to methods like compute_similarity() whenever the corresponding backend is in use. For example, you can configure the URL of your Qdrant server as follows:

export FIFTYONE_BRAIN_SIMILARITY_QDRANT_URL=http://localhost:8080

The FIFTYONE_BRAIN_SIMILARITY_BACKENDS environment variable can be set to a list,of,backends that you want to expose in your session, which may exclude native backends and/or declare additional custom backends whose parameters are defined via additional config modifications of any kind:

export FIFTYONE_BRAIN_SIMILARITY_BACKENDS=custom,sklearn,qdrant

When declaring new backends, you can include * to append new backend(s) without omitting or explicitly enumerating the builtin backends. For example, you can add a custom similarity backend as follows:

export FIFTYONE_BRAIN_SIMILARITY_BACKENDS=*,custom
export FIFTYONE_BRAIN_SIMILARITY_CUSTOM_CONFIG_CLS=your.custom.SimilarityConfig

Visualization methods

You can declare parameters for specific visualization methods by setting environment variables of the form FIFTYONE_BRAIN_VISUALIZATION_<METHOD>_<PARAMETER>. Any settings that you declare in this way will be passed as keyword arguments to methods like compute_visualization() whenever the corresponding method is in use. For example, you can suppress logging messages for the UMAP method as follows:

export FIFTYONE_BRAIN_VISUALIZATION_UMAP_VERBOSE=false

The FIFTYONE_BRAIN_VISUALIZATION_METHODS environment variable can be set to a list,of,methods that you want to expose in your session, which may exclude native methods and/or declare additional custom methods whose parameters are defined via additional config modifications of any kind:

export FIFTYONE_BRAIN_VISUALIZATION_METHODS=custom,umap,tsne

When declaring new methods, you can include * to append new method(s) without omitting or explicitly enumerating the builtin methods. For example, you can add a custom visualization method as follows:

export FIFTYONE_BRAIN_VISUALIZATION_METHODS=*,custom
export FIFTYONE_BRAIN_VISUALIZATION_CUSTOM_CONFIG_CLS=your.custom.VisualzationConfig

Modifying your config in code

You can dynamically modify your brain config at runtime by directly editing the fiftyone.brain.brain_config object.

Any changes to your brain config applied via this manner will immediately take effect in all subsequent calls to fiftyone.brain.brain_config during your current session.

import fiftyone.brain as fob

fob.brain_config.default_similarity_backend = "qdrant"
fob.brain_config.default_visualization_method = "tsne"