Who this is for¶

This tutorial is designed for:

FiftyOne Experience: Beginners who have basic familiarity with FiftyOne's core concepts like Datasets and Samples
Expertise Level: Machine learning practitioners with basic understanding of computer vision and classification tasks
Goals: Users looking to implement classification for new or changing categories without retraining models, or those wanting to quickly label datasets with flexible categories

Assumed Knowledge¶

Computer Vision Concepts¶

Basic understanding of image classification
Familiarity with model inference and confidence scores
Understanding of zero-shot learning (though not required)

Technical Prerequisites¶

Python programming fundamentals
Basic understanding of PyTorch
Experience working with Jupyter notebooks

FiftyOne Concepts¶

You should be familiar with:

Time to Complete¶

Approximately 30-45 minutes

Required Packages¶

It's recommended to use a virtual environment with FiftyOne already installed. You'll need these additional packages:

# Install required packages
pip install fiftyone
pip install torch torchvision
pip install open_clip_torch
pip install transformers

Content Overview¶

The notebook covers:

Dataset Download: Loading the ImageNet-O dataset from Hugging Face
FiftyOne Model Zoo: Using CLIP models from FiftyOne's built-in model zoo for zero-shot classification
OpenCLIP Integration: Implementing zero-shot classification using OpenCLIP models with various architectures
Hugging Face Integration: Running zero-shot classification using models from [Hugging Face's](https://beta-docs.voxel51.com/integrations/huggingface/model hub

Zero-Shot Classification in FiftyOne¶

Traditionally, computer vision models are trained to predict a fixed set of categories. For image classification, for instance, many standard models are trained on the ImageNet dataset, which contains 1,000 categories. All images must be assigned to one of these 1,000 categories, and the model is trained to predict the correct category for each image.

Thanks to the recent advances in multimodal models, it is now possible to perform zero-shot learning, which allows us to predict categories that were not seen during training. This can be especially useful when:

• We want to roughly pre-label images with a new set of categories

• Obtaining labeled data for all categories is impractical or impossible.

• The categories are changing over time, and we want to predict new categories without retraining the model.

Download Dataset¶

In this tutorial we will use the ImageNet-O dataset.

The ImageNet-O dataset consists of images from classes not found in the standard ImageNet-1k dataset. It tests the robustness and out-of-distribution detection capabilities of computer vision models trained on ImageNet-1k.

Let's load the dataset from Voxel51's Hugging Face Org:

In [ ]:

Copied!

import fiftyone as fo
import fiftyone.utils.huggingface as fouh

dataset = fouh.load_from_hub("Voxel51/ImageNet-O")
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

dataset = fouh.load_from_hub("Voxel51/ImageNet-O")

Let's grab the classes from this dataset using the distinct method of the Dataset. These will be the classes we use for zero-shot classification

In [2]:

Copied!

dataset_classes = dataset.distinct("ground_truth.label")
dataset_classes = dataset.distinct("ground_truth.label")

FiftyOne Model Zoo¶

The FiftyOne Model Zoo provides a powerful interface for downloading models and applying them to your FiftyOne datasets. It provides native access to hundreds of pre-trained models, and it also supports downloading arbitrary public or private models whose definitions are provided via GitHub repositories or URLs.

All of these models accept a text_prompt keyword argument, which allows you to override the prompt template used to embed the class names. Zero-shot classification results can vary based on this text!

You can load a model from the Model Zoo using the load_zoo_model method, in this example we will use CLIP:

In [ ]:

Copied!





import torch 
import fiftyone.zoo as foz

clip_zoo_model = foz.load_zoo_model(
    name_or_url="clip-vit-base32-torch",
    text_prompt="A photo of a ",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
)

dataset.apply_model(clip_zoo_model, label_field="clip_classification")
import torch 
import fiftyone.zoo as foz

clip_zoo_model = foz.load_zoo_model(
    name_or_url="clip-vit-base32-torch",
    text_prompt="A photo of a ",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
)

dataset.apply_model(clip_zoo_model, label_field="clip_classification")

As a refresher, dataset.apply_model() uses the model for inference on the dataset, with the parameters provided when we loaded the zoo model. It creates a dataset label field named "clip_classification" and populates each sample with the output.

You can examine the results on the first Sample as follows:

In [6]:

Copied!

dataset.first()['clip_classification']
dataset.first()['clip_classification']

Out[6]:

<Classification: {
    'id': '67d9ddda99e7fb132baf9334',
    'tags': [],
    'label': 'mousetrap',
    'confidence': 0.34010758996009827,
    'logits': None,
}>

Open CLIP Integration¶

FiftyOne integrates natively with the OpenCLIP library, an open source implementation of OpenAI’s CLIP (Contrastive Language-Image Pre-training) model that you can use to run inference on your FiftyOne datasets with a few lines of code!

To get started with OpenCLIP, install the open_clip_torch package:

In [ ]:

Copied!

!pip install open_clip_torch
!pip install open_clip_torch

When running inference with OpenCLIP, you can specify a text prompt to help guide the model towards a solution as well as only specify a certain number of classes to output during zero shot classification.

In [ ]:

Copied!





import torch 
import fiftyone.zoo as foz

open_clip_model = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    text_prompt="A photo of a",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
)

dataset.apply_model(open_clip_model, label_field="open_clip_classification")
import torch 
import fiftyone.zoo as foz

open_clip_model = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    text_prompt="A photo of a",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
)

dataset.apply_model(open_clip_model, label_field="open_clip_classification")

In [10]:

Copied!

dataset.first()['open_clip_classification']
dataset.first()['open_clip_classification']

Out[10]:

<Classification: {
    'id': '67d9de8e99e7fb132bafa2d4',
    'tags': [],
    'label': 'mousetrap',
    'confidence': nan,
    'logits': None,
}>

You can also specify different model architectures and pretrained weights by passing in optional parameters. Pretrained models can be loaded directly from OpenCLIP with the following syntax:

meta_clip = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    clip_model="ViT-B-32-quickgelu",
    pretrained="metaclip_400m",
    text_prompt="A photo of a",
    classes=dataset_classes,
)

Alternatively you can also load a model from Hugging Face’s Model Hub with the following syntax:

import fiftyone.zoo as foz

open_clip_model = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    clip_model="hf-hub:repo-name/model-name",
    pretrained="",
)

As a concrete example, if you were interested in the StreetCLIP model you would use:

street_clip_model = foz.load_zoo_model(
    name_or_url="open-clip-torch",
    pretrained="",
    clip_model="hf-hub:geolocal/StreetCLIP"
)

Hugging Face Integration¶

You can also run models from Hugging Face as a Zoo Model with FiftyOne's Hugging Face Integration. Note: These models must be fully integrated in the Hugging Face transformers library, some model weights may be available via Hugging Face but not fully integrated into the transformers library.

To load a model from the Hugging Face Hub, set name_or_url=zero-shot-classification-transformer-torch. This specifies that you want to a zero-shot image classification model from the Hugging Face Transformers library. You can then specify the model via the name_or_path argument. This should be the repository name or model identifier of the model you want to load:

In [ ]:

Copied!





import torch 
import fiftyone.zoo as foz

siglip_model = foz.load_zoo_model(
    name_or_url="zero-shot-classification-transformer-torch",
    name_or_path="google/siglip2-so400m-patch14-384",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
    )

dataset.apply_model(siglip_model, label_field="siglip2_classification")
import torch 
import fiftyone.zoo as foz

siglip_model = foz.load_zoo_model(
    name_or_url="zero-shot-classification-transformer-torch",
    name_or_path="google/siglip2-so400m-patch14-384",
    classes=dataset_classes,
    device="cuda" if torch.cuda.is_available() else "cpu",
    # install_requirements=True # uncomment this line if you are running this code for the first time
    )

dataset.apply_model(siglip_model, label_field="siglip2_classification")

We can examine the output for the first Sample as shown below. Note that not all models will output a value for confidence or logits.

In [16]:

Copied!

dataset.first()['siglip2_classification']
dataset.first()['siglip2_classification']

Out[16]:

<Classification: {
    'id': '67d9dff499e7fb132bafb274',
    'tags': [],
    'label': 'frying pan',
    'confidence': 0.10734604299068451,
    'logits': array([-14.63465  , -16.027546 , -14.841102 , -15.603795 , -15.634769 ,
           -26.755354 , -28.286345 , -28.364231 , -27.1351   , -26.7131   ,
           -27.507458 , -26.499924 , -21.994976 , -13.691553 , -28.55092  ,
           -25.564491 , -14.591154 , -26.281277 , -26.274792 , -13.6278   ,
           -14.481212 , -24.684128 , -24.921196 , -28.01366  , -24.428066 ,
           -26.221783 , -24.934784 , -27.947395 , -28.367542 , -13.191751 ,
           -26.12117  , -27.978542 , -21.961689 , -13.579732 , -26.58063  ,
           -13.924773 , -27.960869 , -13.0043545, -26.122686 , -20.667011 ,
           -22.799128 , -20.430523 , -15.019728 , -27.692225 , -27.954025 ,
           -12.913255 , -27.146004 , -27.823196 , -28.423857 , -23.00042  ,
           -23.592287 , -25.413513 , -26.26255  , -15.5457735, -25.749825 ,
           -24.848305 , -28.531633 , -26.165901 , -23.151388 , -19.266499 ,
           -24.093252 , -14.020183 , -26.386292 , -28.239845 , -22.07525  ,
           -27.582024 , -25.223106 , -19.29158  , -12.56849  , -25.734182 ,
           -22.642479 , -24.389975 , -24.530643 , -23.746538 , -27.28984  ,
           -25.155281 , -25.74003  , -19.188692 , -24.167593 , -24.387491 ,
           -26.841904 , -23.65986  , -13.781616 , -26.865953 , -28.562445 ,
           -14.650488 , -27.812248 , -23.470142 , -28.144627 , -28.123253 ,
           -30.08646  , -24.829483 , -15.422783 , -26.38512  , -22.980677 ,
           -12.935882 , -24.881578 , -26.172321 , -25.142155 , -27.739046 ,
           -28.699215 , -28.724304 , -28.418411 , -27.454609 , -22.694475 ,
           -25.072674 , -28.214672 , -14.173061 , -13.863817 , -19.484104 ,
           -28.667864 , -25.375149 , -26.225508 , -24.78184  , -26.520378 ,
           -23.774696 , -27.95382  , -28.486088 , -21.505005 , -22.355341 ,
           -23.349148 , -24.033804 , -23.604559 , -13.987801 , -13.772884 ,
           -28.77172  , -23.77694  , -27.695286 , -13.899179 , -24.19553  ,
           -28.618416 , -23.60842  , -16.364819 , -13.776959 , -13.612553 ,
           -28.31878  , -20.912731 , -24.29652  , -25.350151 , -28.747776 ,
           -14.814828 , -13.303535 , -28.47046  , -24.823593 , -28.176216 ,
           -22.470306 , -20.222847 , -28.176281 , -29.118187 , -20.819706 ,
           -23.281315 , -28.769344 , -27.104895 , -27.658867 , -25.080599 ,
           -22.988195 , -24.258585 , -20.518099 , -28.228722 , -22.119791 ,
           -23.093803 , -25.818127 , -19.867298 , -16.562128 , -22.246336 ,
           -25.318657 , -24.268326 , -26.903309 , -26.76782  , -24.200016 ,
           -28.868828 , -26.636438 , -24.618382 , -18.01135  , -22.788385 ,
           -29.595188 , -29.231174 , -26.795033 , -22.59452  , -14.442874 ,
           -14.756289 , -25.36583  , -20.68103  , -27.915737 , -19.589752 ,
           -27.01284  , -24.279789 , -29.063992 , -27.855495 , -15.463411 ,
           -25.507193 , -24.51913  , -22.107254 , -22.242912 , -23.621586 ,
           -15.402868 , -24.746723 , -23.633163 , -28.656672 , -23.97926  ],
          dtype=float32),
}>

Any model that can be run in a Hugging Face pipeline for the zero-shot-image-classification task can be loaded as a Zoo model.

A good first entry point is to just do it and pass the model name into name_or_path in the load_zoo_model method of the dataset. If a Hugging Face model is not compatible with the integration, you'll see an error to the effect of:

ValueError: Unrecognized model in <whatever-model-name>

In this case, you will need to run the model manually. All this means is that you need to instantiate the model, it's processor, and write some logic to parse it the model output a FiftyOne Classification.

Refer to this documentation for details on how to manually parse model outputs as a FiftyOne Classification.

Conclusion¶

In this tutorial, you've learned how to:

Implement zero-shot classification using multiple model architectures without needing to retrain models
Use three different approaches for zero-shot classification:
- FiftyOne's Model Zoo CLIP models
- OpenCLIP models with custom architectures
- Hugging Face Transformers integration
Customize text prompts to improve classification results
Apply models to a FiftyOne dataset and access the classification results

Key Takeaways¶

Zero-shot classification enables prediction of new categories without model retraining
Different model architectures and text prompts can significantly impact results
FiftyOne provides flexible integrations with popular model frameworks
Classification results are stored as standard FiftyOne Classifications, making them easy to analyze and evaluate

Next Steps¶

Check out this in-depth end-to-end tutorial for Zero-Shot Classification which includes details on how to evaluate your results
Learn how to evaluate classification results

You might also be interested in reading these blogs:

For more resources and updates, follow us on LinkedIn or join our Discord Community.