Who this is for¶

This tutorial is designed for machine learning practitioners who:

Have basic familiarity with FiftyOne (used it at least once before)
Are interested in exploring zero-shot object detection without training models
Want to quickly test different zero-shot detection models on their datasets

Assumed Knowledge¶

Computer Vision Concepts:

Understanding of object detection and bounding boxes
Familiarity with confidence scores and model predictions
Basic knowledge of zero-shot learning concepts

Technical Requirements:

Intermediate Python programming skills
Experience with PyTorch and/or Hugging Face
Ability to work with image datasets and common formats (jpg, png)

FiftyOne Concepts: You should be familiar with:

Time to complete¶

Estimated time: 30-45 minutes

Setup: 5-10 minutes
Tutorial: 20-25 minutes
Experimentation: 10+ minutes

Required packages¶

Make sure you have a virtual environment with FiftyOne already installed. Then install the following packages:

# Install required packages
pip install fiftyone
pip install torch torchvision
pip install transformers<=4.49
pip install ultralytics
pip install pillow

What's covered in this tutorial¶

This tutorial covers:

Dataset Loading - Loading a street scene dataset from FiftyOne's Dataset Zoo
Hugging Face Integration - Using OWL-ViT for zero-shot detection through FiftyOne's Hugging Face integration
Ultralytics Integration - Implementing YOLO-World for zero-shot detection
Plugin Usage - Exploring the Florence2 plugin for additional zero-shot capabilities
Custom Implementation - Understanding how to implement arbitrary zero-shot detection models in FiftyOne

Each section builds upon the previous ones, demonstrating different approaches to zero-shot detection while highlighting FiftyOne's flexibility in working with various model frameworks.

Zero-Shot Detection¶

Load Dataset¶

Let's load a Dataset from the FiftyOne Dataset Zoo. In this tutorial, we'll use the Quickstart Geo dataset. This is a a small Dataset which consists of 500 images from the validation split of the BDD100K dataset in the New York City area with object detections and GPS timestamps.

In [ ]:

Copied!

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart-geo")
import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart-geo")

Let's make up a list of classes to detect. Since the Dataset we're working with is from New York City streets, we'll focus on vehicles, traffic infrastructure, and urban elements that we'd expect to see in NYC traffic scenes.

This includes various car types, traffic signals, street furniture, and public transportation.

In [2]:

Copied!





detection_classes = [
    "yellow cab",
    "sedan",
    "coupe",
    "hatchback",
    "SUV",
    "pickup truck",
    "station wagon",
    "crossover",
    "minivan",
    "green light",
    "red light",
    "illuminated tail lights",
    "illuminated head lights",
    "tow truck",
    "parking meter",
    "traffic barrier",
    "traffic cone",
    "bus stop",
    "storefront",
    "construction vehicle",
    "municipal bus",
    "charter bus"
]
detection_classes = [
    "yellow cab",
    "sedan",
    "coupe",
    "hatchback",
    "SUV",
    "pickup truck",
    "station wagon",
    "crossover",
    "minivan",
    "green light",
    "red light",
    "illuminated tail lights",
    "illuminated head lights",
    "tow truck",
    "parking meter",
    "traffic barrier",
    "traffic cone",
    "bus stop",
    "storefront",
    "construction vehicle",
    "municipal bus",
    "charter bus"
]

Model Zoo¶

The FiftyOne Model Zoo provides a powerful interface for downloading models and applying them to your FiftyOne datasets.

It provides native access to hundreds of pre-trained models, and it also supports downloading arbitrary public or private models whose definitions are provided via GitHub repositories or URLs.

In fact, the Model Zoo is so flexible that you can natively load certain Hugging Face Transformers models and Ultralytics models for zero-shot object detection as a Zoo model via the load_zoo_model method.

Hugging Face Integration¶

FiftyOne integrates with Hugging Face's Transformers library for Zero Shot Detection models. This allows you to load a Transformers model as a Zoo Model.

To load a model from the Hugging Face Hub, set name_or_url=zero-shot-detection-transformer-torch. This specifies that you want to a zero-shot object detection model from the Hugging Face Transformers library. You can then specify the model via the name_or_path argument. This should be the repository name or model identifier of the model you want to load.

Note: the confidence_thresh parameter is optional and can be used to filter out predictions with confidence scores below the specified threshold. You may need to adjust this value based on the model and dataset you are working.

In [3]:

Copied!





import torch
import fiftyone.zoo as foz

device="cuda" if torch.cuda.is_available() else "cpu"

owlvit = foz.load_zoo_model(
    "zero-shot-detection-transformer-torch",
    text_prompt="a photo of a ", # per the model card
    name_or_path="google/owlvit-base-patch32",  # HF model name or path
    classes=detection_classes,
    device=device,
    confidence_thresh=0.1 #setting aribtrarily low threshold
    # install_requirements=True # uncomment to install the necessary requirements
)

dataset.apply_model(
    owlvit, 
    label_field="owlvit_detections",
    )
import torch
import fiftyone.zoo as foz

device="cuda" if torch.cuda.is_available() else "cpu"

owlvit = foz.load_zoo_model(
    "zero-shot-detection-transformer-torch",
    text_prompt="a photo of a ", # per the model card
    name_or_path="google/owlvit-base-patch32",  # HF model name or path
    classes=detection_classes,
    device=device,
    confidence_thresh=0.1 #setting aribtrarily low threshold
    # install_requirements=True # uncomment to install the necessary requirements
)

dataset.apply_model(
    owlvit, 
    label_field="owlvit_detections",
    )

 100% |█████████████████| 500/500 [29.5s elapsed, 0s remaining, 17.7 samples/s]

As a refresher, dataset.apply_model() uses the model for inference on the dataset, with the parameters provided when we loaded the zoo model. It creates a dataset label field named "owlvit_detections" and populates each sample with the output.

You can examine the results on by skipping to a random Sample as follows:

In [8]:

Copied!

dataset.skip(42).first()['owlvit_detections']
dataset.skip(42).first()['owlvit_detections']

Out[8]:

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef665',
            'attributes': {},
            'tags': [],
            'label': 'storefront',
            'bounding_box': [
                0.0014444444444444446,
                -0.0026875,
                0.17194444444444443,
                0.47613281250000006,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.13197840750217438,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef666',
            'attributes': {},
            'tags': [],
            'label': 'storefront',
            'bounding_box': [
                0.13490277777777776,
                0.26649218750000003,
                0.10397222222222224,
                0.19666406250000001,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.15454693138599396,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef667',
            'attributes': {},
            'tags': [],
            'label': 'storefront',
            'bounding_box': [
                0.10261111111111111,
                0.091953125,
                0.14166666666666666,
                0.379328125,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.11891092360019684,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef668',
            'attributes': {},
            'tags': [],
            'label': 'municipal bus',
            'bounding_box': [
                0.5685138888888889,
                0.31115624999999997,
                0.15155555555555564,
                0.1060546875,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.13214434683322906,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef669',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.4620833333333333,
                0.3072578125,
                0.11956944444444449,
                0.15671093749999998,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.29390838742256165,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef66a',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.5698333333333333,
                0.33166406249999997,
                0.11266666666666668,
                0.166296875,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.17897413671016693,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef66b',
            'attributes': {},
            'tags': [],
            'label': 'sedan',
            'bounding_box': [
                0.5687222222222222,
                0.3703671875,
                0.08163888888888884,
                0.14845312500000002,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.2534674406051636,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef66c',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.6298055555555555,
                0.34385937499999997,
                0.19693055555555558,
                0.2575,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.3990810811519623,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef66d',
            'attributes': {},
            'tags': [],
            'label': 'sedan',
            'bounding_box': [
                0.5770555555555555,
                0.37800781250000004,
                0.07523611111111106,
                0.14806249999999999,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.22299253940582275,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef66e',
            'attributes': {},
            'tags': [],
            'label': 'sedan',
            'bounding_box': [
                0.5954861111111112,
                0.38992187500000003,
                0.06872222222222225,
                0.15815624999999994,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.10626339912414551,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef66f',
            'attributes': {},
            'tags': [],
            'label': 'minivan',
            'bounding_box': [
                0.7587499999999999,
                0.24249218749999998,
                0.2410277777777779,
                0.41490625000000003,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.3947419822216034,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef670',
            'attributes': {},
            'tags': [],
            'label': 'illuminated tail lights',
            'bounding_box': [
                0.2708333333333333,
                0.477,
                0.07577777777777778,
                0.051179687500000084,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.12947803735733032,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef671',
            'attributes': {},
            'tags': [],
            'label': 'yellow cab',
            'bounding_box': [
                0.2574722222222222,
                0.2663515625,
                0.3298055555555555,
                0.360625,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.2435181438922882,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef672',
            'attributes': {},
            'tags': [],
            'label': 'yellow cab',
            'bounding_box': [
                0.2605972222222222,
                0.220921875,
                0.32504166666666673,
                0.4124921875,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.3468368649482727,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef673',
            'attributes': {},
            'tags': [],
            'label': 'illuminated tail lights',
            'bounding_box': [
                0.2707361111111111,
                0.5041015625,
                0.03811111111111111,
                0.04823437500000001,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.14190329611301422,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef674',
            'attributes': {},
            'tags': [],
            'label': 'illuminated tail lights',
            'bounding_box': [
                0.4896388888888889,
                0.4840703125,
                0.063625,
                0.037249999999999964,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.1208077222108841,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef675',
            'attributes': {},
            'tags': [],
            'label': 'illuminated tail lights',
            'bounding_box': [
                0.5343333333333333,
                0.5111484374999999,
                0.030236111111111085,
                0.03962500000000002,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.16234591603279114,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef676',
            'attributes': {},
            'tags': [],
            'label': 'illuminated tail lights',
            'bounding_box': [
                0.5202361111111111,
                0.5022734375,
                0.05220833333333338,
                0.061531249999999996,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.12089472264051437,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef677',
            'attributes': {},
            'tags': [],
            'label': 'parking meter',
            'bounding_box': [
                0.004472222222222223,
                0.18684375,
                0.9988055555555555,
                0.534171875,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.11007577180862427,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef678',
            'attributes': {},
            'tags': [],
            'label': 'station wagon',
            'bounding_box': [
                0.008263888888888888,
                0.1459609375,
                0.9990833333333332,
                0.6118125,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.12653888761997223,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef679',
            'attributes': {},
            'tags': [],
            'label': 'tow truck',
            'bounding_box': [
                0.6108055555555555,
                0.23802343750000002,
                0.38940277777777776,
                0.412671875,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.1311836540699005,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dae20b9d9cbc6d0ef67a',
            'attributes': {},
            'tags': [],
            'label': 'station wagon',
            'bounding_box': [
                -0.00175,
                0.5913906250000001,
                0.9975972222222222,
                0.4046484375,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.14577694237232208,
            'index': None,
        }>,
    ],
}>

Any model that can be run in a Hugging Face pipeline for the zero-shot-object-detection task can be loaded as a Zoo model.

A good first entry point is to just do it and pass the model name into name_or_path in the load_zoo_model method of the dataset. If a Hugging Face model is not compatible with the integration, you'll see an error to the effect of:

ValueError: Unrecognized model in <whatever-model-name>

In this case, you will need to run the model manually. All this means is that you need to instantiate the model, it's processor, and write some logic to parse the model output a FiftyOne Detection. I'll show you how to this later on in this tutorial.

Ultralytics¶

FiftyOne integrates natively with Ultralytics, so you can load, fine-tune, and run inference with your favorite Ultralytics models on your FiftyOne datasets with just a few lines of code.

Check out the documention for our Ultralytics integration if you're interested in manually using an Ultralytics model rather than as a Zoo model.

In [ ]:

Copied!

!pip install ultralytics
!pip install ultralytics

In [9]:

Copied!





import torch
import fiftyone.zoo as foz

device="cuda" if torch.cuda.is_available() else "cpu"

yolo_world = foz.load_zoo_model(
    "yolov8s-world-torch", 
    classes=detection_classes,
    device=device,
    confidence_thresh=0.2
    # install_requirements=True # uncomment to install the necessary requirements
    )

dataset.apply_model(yolo_world, label_field="yolow_detections")
import torch
import fiftyone.zoo as foz

device="cuda" if torch.cuda.is_available() else "cpu"

yolo_world = foz.load_zoo_model(
    "yolov8s-world-torch", 
    classes=detection_classes,
    device=device,
    confidence_thresh=0.2
    # install_requirements=True # uncomment to install the necessary requirements
    )

dataset.apply_model(yolo_world, label_field="yolow_detections")

 100% |█████████████████| 500/500 [13.9s elapsed, 0s remaining, 36.3 samples/s]

In [10]:

Copied!

dataset.skip(42).first()['yolow_detections']
dataset.skip(42).first()['yolow_detections']

Out[10]:

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1dc450b9d9cbc6d0f0f61',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.7529513612389565,
                0.24498116970062256,
                0.2470119446516037,
                0.40709465742111206,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.5598529577255249,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dc450b9d9cbc6d0f0f62',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.6284473314881325,
                0.34372271597385406,
                0.19322185218334198,
                0.25756362080574036,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.5495367646217346,
            'index': None,
        }>,
    ],
}>

Plugins¶

You can also run zero-shot detection via FiftyOne Plugins. The following code will show you how to use the Florence2.

The example below will show you how to use the Florence2 plugin for zero-shot object detection and zero-shot open vocabulary detection. Begin by downloading the plugin and installing requirements:

In [ ]:

Copied!

!fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin
!fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin

In [ ]:

Copied!

!fiftyone plugins requirements @jacbobmarks/florence2 --install
!fiftyone plugins requirements @jacbobmarks/florence2 --install

Next, instantiate the operator:

In [11]:

Copied!

import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-base-ft"

florence2_detection = foo.get_operator("@jacobmarks/florence2/detect_with_florence2")
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-base-ft"

florence2_detection = foo.get_operator("@jacobmarks/florence2/detect_with_florence2")

Python version is above 3.10, patching the collections module.

/home/harpreet/miniconda3/envs/fiftyone/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:595: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(

You should start a delegated service for this Operator, you can do that by opening your terminal and executing the following command:

fiftyone delegated launch

You'll use the await syntax and pass the delegate=True argument when running this plugin via notebook. By doing this you can run the model in the background without blocking your code or the application. Here's how you can use the plugin for zero-shot object detection:

In [12]:

Copied!





await florence2_detection(
    dataset,
    model_path=MODEL_PATH,
    detection_type="detection",
    output_field="zero_shot_detections",
    delegate=True
)
await florence2_detection(
    dataset,
    model_path=MODEL_PATH,
    detection_type="detection",
    output_field="zero_shot_detections",
    delegate=True
)

Out[12]:

<fiftyone.operators.executor.ExecutionResult at 0x7bdb28191fd0>

You'll notice a progress bar in the temrinal window where you launched the delegated service incrementing as the model runs inference. You can track the progress there until completion there.

In [13]:

Copied!

dataset.first()['zero_shot_detections']
dataset.first()['zero_shot_detections']

Out[13]:

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1dcd87bdb18075d1d18d4',
            'attributes': {},
            'tags': [],
            'label': 'fire hydrant',
            'bounding_box': [
                0.9385000228881836,
                0.793500010172526,
                0.059999942779541016,
                0.07600004408094618,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dcd87bdb18075d1d18d5',
            'attributes': {},
            'tags': [],
            'label': 'street light',
            'bounding_box': [
                0.23450000286102296,
                0.4015000237358941,
                0.012999987602233887,
                0.010999976264105902,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1dcd87bdb18075d1d18d6',
            'attributes': {},
            'tags': [],
            'label': 'traffic sign',
            'bounding_box': [0.28949999809265137, 0.5125, 0.025, 0.03500001695421007],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
    ],
}>

You can also use Florence2 for zero-shot open vocabulary detection. Note that the model only supports passing one candidate label for this task:

In [14]:

Copied!





await florence2_detection(
    dataset,
    model_path=MODEL_PATH,
    detection_type="open_vocabulary_detection",
    text_prompt = "pedestrain in intersection", # the object you want to detect
    output_field="open_detection",
    delegate=True
)
await florence2_detection(
    dataset,
    model_path=MODEL_PATH,
    detection_type="open_vocabulary_detection",
    text_prompt = "pedestrain in intersection", # the object you want to detect
    output_field="open_detection",
    delegate=True
)

Out[14]:

<fiftyone.operators.executor.ExecutionResult at 0x7bdca36436d0>

In [17]:

Copied!

dataset.skip(42).first()['open_detection']
dataset.skip(42).first()['open_detection']

Out[17]:

<Detections: {'detections': []}>

Visit the Florence2 Plugin's GitHub Repo for more detail about using this plugin.

Arbitrary Models¶

If you do not want to use the model zoo or other integrations you can always run a zero-shot object detection model outside of FiftyOne. To have the model results available in FiftyOne there are some other manual steps you will need to complete. The process of converting predictions to FiftyOne format follows the same general pattern:

Standardize Bounding Box Format
- FiftyOne Detection labels expects bounding boxes in relative coordinates [0,1] (that is, they should be normalized relative to the input resolution)
- Format must be [top-left-x, top-left-y, width, height]
- Most models output absolute coordinates or different formats, so conversion is usually needed
Create Detection Objects
- Each Detection object needs three main components:
  - label: the class name
  - bounding_box: the normalized coordinates
  - confidence: the detection score
- The individual Detection objects must be grouped into a Detections Field for each Sample.
Batch Processing Strategy
- Instead of updating samples one by one, collect all detections
- Use dataset.set_values() for efficient batch updates
- This is much faster than individual sample.save() calls

The core workflow is:

Get model predictions
Convert coordinates to FiftyOne's expected format
Create Detection objects
Group them into Detections objects (one per Sample)
Batch update the Dataset

This pattern remains the same regardless of the model you're using, whether it's from the Hugging Face Hub, Torch Hub, or some brand new SOTA model that you can only use via it's GitHub Repo. The only part that changes is how you extract and convert the specific model's output into FiftyOne Detection format.

Here is an example of running the OmDet model from a Hugging Face repo:

In [ ]:

Copied!





import torch
import fiftyone as fo
from PIL import Image
from transformers import AutoProcessor, OmDetTurboForObjectDetection

device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize model and processor
processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
model = OmDetTurboForObjectDetection.from_pretrained(
    "omlab/omdet-turbo-swin-tiny-hf",
    device_map=device)

filepaths = dataset.values("filepath")

all_detections = []
for filepath in filepaths:
    # Load and process image
    image = Image.open(filepath)
    height, width = image.size[::-1]  # Get dimensions in same format as target_sizes
    
    inputs = processor(image, text=detection_classes, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    results = processor.post_process_grounded_object_detection(
        outputs,
        text_labels=detection_classes,
        target_sizes=[image.size[::-1]],  # Keep model's expected format
        threshold=0.3,
        nms_threshold=0.3,
    )[0]
    
    scores = results["scores"].cpu().numpy()
    boxes = results["boxes"].cpu().numpy()
    text_labels = results["text_labels"]
    
    detections = []
    for score, class_name, box in zip(scores, text_labels, boxes):
        x1, y1, x2, y2 = box
        
        # First normalize all coordinates by their respective dimensions (x/width, y/height)
        x1 = x1 / width
        y1 = y1 / height
        x2 = x2 / width
        y2 = y2 / height
    
        # Then calculate width and height as differences of normalized coordinates
        w = x2 - x1  # width is right_x - left_x
        h = y2 - y1  # height is bottom_y - top_y
        
        detection = fo.Detection(
            label=class_name,
            bounding_box=[x1, y1, w, h],
            confidence=float(score)
        )
        detections.append(detection)
    
    all_detections.append(fo.Detections(detections=detections))

dataset.set_values("omdet_predictions", all_detections)
import torch
import fiftyone as fo
from PIL import Image
from transformers import AutoProcessor, OmDetTurboForObjectDetection

device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize model and processor
processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
model = OmDetTurboForObjectDetection.from_pretrained(
    "omlab/omdet-turbo-swin-tiny-hf",
    device_map=device)

filepaths = dataset.values("filepath")

all_detections = []
for filepath in filepaths:
    # Load and process image
    image = Image.open(filepath)
    height, width = image.size[::-1]  # Get dimensions in same format as target_sizes
    
    inputs = processor(image, text=detection_classes, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    results = processor.post_process_grounded_object_detection(
        outputs,
        text_labels=detection_classes,
        target_sizes=[image.size[::-1]],  # Keep model's expected format
        threshold=0.3,
        nms_threshold=0.3,
    )[0]
    
    scores = results["scores"].cpu().numpy()
    boxes = results["boxes"].cpu().numpy()
    text_labels = results["text_labels"]
    
    detections = []
    for score, class_name, box in zip(scores, text_labels, boxes):
        x1, y1, x2, y2 = box
        
        # First normalize all coordinates by their respective dimensions (x/width, y/height)
        x1 = x1 / width
        y1 = y1 / height
        x2 = x2 / width
        y2 = y2 / height
    
        # Then calculate width and height as differences of normalized coordinates
        w = x2 - x1  # width is right_x - left_x
        h = y2 - y1  # height is bottom_y - top_y
        
        detection = fo.Detection(
            label=class_name,
            bounding_box=[x1, y1, w, h],
            confidence=float(score)
        )
        detections.append(detection)
    
    all_detections.append(fo.Detections(detections=detections))

dataset.set_values("omdet_predictions", all_detections)

In [18]:

Copied!

dataset.skip(42).first()['omdet_predictions']
dataset.skip(42).first()['omdet_predictions']

Out[18]:

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16cc',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.6251439094543457,
                0.3419267018636068,
                0.20014638900756831,
                0.2533126407199436,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.697697103023529,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16cd',
            'attributes': {},
            'tags': [],
            'label': 'sedan',
            'bounding_box': [
                0.26125240325927734,
                0.20989903344048394,
                0.31747684478759763,
                0.41080968644883903,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.48624932765960693,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16ce',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.451490592956543,
                0.3090142567952474,
                0.12838659286499027,
                0.15956696404351128,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.47672003507614136,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16cf',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.7510287761688232,
                0.249177000257704,
                0.24851012229919434,
                0.39371571011013456,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.43500420451164246,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d0',
            'attributes': {},
            'tags': [],
            'label': 'yellow cab',
            'bounding_box': [
                0.26125240325927734,
                0.20989903344048394,
                0.31747684478759763,
                0.41080968644883903,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.4230446517467499,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d1',
            'attributes': {},
            'tags': [],
            'label': 'minivan',
            'bounding_box': [
                0.7510287761688232,
                0.249177000257704,
                0.24851012229919434,
                0.39371571011013456,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.39243268966674805,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d2',
            'attributes': {},
            'tags': [],
            'label': 'minivan',
            'bounding_box': [
                0.6251439094543457,
                0.3419267018636068,
                0.20014638900756831,
                0.2533126407199436,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.38366976380348206,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d3',
            'attributes': {},
            'tags': [],
            'label': 'minivan',
            'bounding_box': [
                0.451490592956543,
                0.3090142567952474,
                0.12838659286499027,
                0.15956696404351128,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.37925446033477783,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d4',
            'attributes': {},
            'tags': [],
            'label': 'storefront',
            'bounding_box': [
                0.13803030252456666,
                0.25423321194118925,
                0.1100474715232849,
                0.21988618638780383,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.3741093575954437,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d5',
            'attributes': {},
            'tags': [],
            'label': 'station wagon',
            'bounding_box': [
                0.26125240325927734,
                0.20989903344048394,
                0.31747684478759763,
                0.41080968644883903,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.37295597791671753,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d6',
            'attributes': {},
            'tags': [],
            'label': 'SUV',
            'bounding_box': [
                0.5690805912017822,
                0.3717874738905165,
                0.0663639545440674,
                0.121239980061849,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.3514840304851532,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d7',
            'attributes': {},
            'tags': [],
            'label': 'pickup truck',
            'bounding_box': [
                0.7510287761688232,
                0.249177000257704,
                0.24851012229919434,
                0.39371571011013456,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.33661335706710815,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d8',
            'attributes': {},
            'tags': [],
            'label': 'coupe',
            'bounding_box': [
                0.26125240325927734,
                0.20989903344048394,
                0.31747684478759763,
                0.41080968644883903,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.3173862099647522,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16d9',
            'attributes': {},
            'tags': [],
            'label': 'pickup truck',
            'bounding_box': [
                0.6251439094543457,
                0.3419267018636068,
                0.20014638900756831,
                0.2533126407199436,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.3107627332210541,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16da',
            'attributes': {},
            'tags': [],
            'label': 'station wagon',
            'bounding_box': [
                0.6251439094543457,
                0.3419267018636068,
                0.20014638900756831,
                0.2533126407199436,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.30744266510009766,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1de7e0b9d9cbc6d0f16db',
            'attributes': {},
            'tags': [],
            'label': 'station wagon',
            'bounding_box': [
                0.7510287761688232,
                0.249177000257704,
                0.24851012229919434,
                0.39371571011013456,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': 0.3015192151069641,
            'index': None,
        }>,
    ],
}>

Summary¶

This tutorial has introduced you to several approaches for performing zero-shot object detection using FiftyOne:

Using pre-trained models through the Hugging Face integration
Leveraging Ultralytics' YOLO-World model
Exploring plugin-based solutions like Florence2
Implementing custom zero-shot detection models

Next Steps¶

To continue learning, you can:

• Learn more about our integration with Hugging Face

• Check out the Zero-Shot Detection Plugin and learn more about Plugins in general

• Learn more about adding object detections to a Dataset

• Use the Moondream2 Plugin for zero-shot object detection

• Use the Florence2 Plugin for zero-shot object detection

• Learn how to evaluate object detections with FiftyOne

• Learn more in our blog, Zero-Shot Image Classification with Multimodal Models and FiftyOne

Remember that zero-shot detection is a rapidly evolving field - the approaches shown here are just the beginning. FiftyOne's flexible architecture allows you to easily incorporate new models and techniques as they become available.