Who this is for¶
This tutorial is designed for machine learning practitioners who:
- Have basic familiarity with FiftyOne (used it at least once before)
- Are interested in exploring zero-shot object detection without training models
- Want to quickly test different zero-shot detection models on their datasets
Assumed Knowledge¶
Computer Vision Concepts:
- Understanding of object detection and bounding boxes
- Familiarity with confidence scores and model predictions
- Basic knowledge of zero-shot learning concepts
Technical Requirements:
- Intermediate Python programming skills
- Experience with PyTorch and/or Hugging Face
- Ability to work with image datasets and common formats (jpg, png)
FiftyOne Concepts: You should be familiar with:
Time to complete¶
Estimated time: 30-45 minutes
- Setup: 5-10 minutes
- Tutorial: 20-25 minutes
- Experimentation: 10+ minutes
Required packages¶
Make sure you have a virtual environment with FiftyOne already installed. Then install the following packages:
# Install required packages
pip install fiftyone
pip install torch torchvision
pip install transformers<=4.49
pip install ultralytics
pip install pillow
What's covered in this tutorial¶
This tutorial covers:
Dataset Loading - Loading a street scene dataset from FiftyOne's Dataset Zoo
Hugging Face Integration - Using OWL-ViT for zero-shot detection through FiftyOne's Hugging Face integration
Ultralytics Integration - Implementing YOLO-World for zero-shot detection
Plugin Usage - Exploring the Florence2 plugin for additional zero-shot capabilities
Custom Implementation - Understanding how to implement arbitrary zero-shot detection models in FiftyOne
Each section builds upon the previous ones, demonstrating different approaches to zero-shot detection while highlighting FiftyOne's flexibility in working with various model frameworks.
Zero-Shot Detection¶
Load Dataset¶
Let's load a Dataset from the FiftyOne Dataset Zoo. In this tutorial, we'll use the Quickstart Geo dataset. This is a a small Dataset which consists of 500 images from the validation split of the BDD100K dataset in the New York City area with object detections and GPS timestamps.
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart-geo")
Let's make up a list of classes to detect. Since the Dataset we're working with is from New York City streets, we'll focus on vehicles, traffic infrastructure, and urban elements that we'd expect to see in NYC traffic scenes.
This includes various car types, traffic signals, street furniture, and public transportation.
detection_classes = [
"yellow cab",
"sedan",
"coupe",
"hatchback",
"SUV",
"pickup truck",
"station wagon",
"crossover",
"minivan",
"green light",
"red light",
"illuminated tail lights",
"illuminated head lights",
"tow truck",
"parking meter",
"traffic barrier",
"traffic cone",
"bus stop",
"storefront",
"construction vehicle",
"municipal bus",
"charter bus"
]
Model Zoo¶
The FiftyOne Model Zoo provides a powerful interface for downloading models and applying them to your FiftyOne datasets.
It provides native access to hundreds of pre-trained models, and it also supports downloading arbitrary public or private models whose definitions are provided via GitHub repositories or URLs.
In fact, the Model Zoo is so flexible that you can natively load certain Hugging Face Transformers models and Ultralytics models for zero-shot object detection as a Zoo model via the load_zoo_model
method.
Hugging Face Integration¶
FiftyOne integrates with Hugging Face's Transformers library for Zero Shot Detection models. This allows you to load a Transformers model as a Zoo Model.
To load a model from the Hugging Face Hub, set name_or_url=zero-shot-detection-transformer-torch
. This specifies that you want to a zero-shot object detection model from the Hugging Face Transformers library. You can then specify the model via the name_or_path
argument. This should be the repository name or model identifier of the model you want to load.
Note: the confidence_thresh
parameter is optional and can be used to filter out predictions with confidence scores below the specified threshold. You may need to adjust this value based on the model and dataset you are working.
import torch
import fiftyone.zoo as foz
device="cuda" if torch.cuda.is_available() else "cpu"
owlvit = foz.load_zoo_model(
"zero-shot-detection-transformer-torch",
text_prompt="a photo of a ", # per the model card
name_or_path="google/owlvit-base-patch32", # HF model name or path
classes=detection_classes,
device=device,
confidence_thresh=0.1 #setting aribtrarily low threshold
# install_requirements=True # uncomment to install the necessary requirements
)
dataset.apply_model(
owlvit,
label_field="owlvit_detections",
)
100% |█████████████████| 500/500 [29.5s elapsed, 0s remaining, 17.7 samples/s]
As a refresher, dataset.apply_model()
uses the model for inference on the dataset, with the parameters provided when we loaded the zoo model. It creates a dataset label field named "owlvit_detections" and populates each sample with the output.
You can examine the results on by skipping to a random Sample as follows:
dataset.skip(42).first()['owlvit_detections']
<Detections: { 'detections': [ <Detection: { 'id': '67e1dae20b9d9cbc6d0ef665', 'attributes': {}, 'tags': [], 'label': 'storefront', 'bounding_box': [ 0.0014444444444444446, -0.0026875, 0.17194444444444443, 0.47613281250000006, ], 'mask': None, 'mask_path': None, 'confidence': 0.13197840750217438, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef666', 'attributes': {}, 'tags': [], 'label': 'storefront', 'bounding_box': [ 0.13490277777777776, 0.26649218750000003, 0.10397222222222224, 0.19666406250000001, ], 'mask': None, 'mask_path': None, 'confidence': 0.15454693138599396, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef667', 'attributes': {}, 'tags': [], 'label': 'storefront', 'bounding_box': [ 0.10261111111111111, 0.091953125, 0.14166666666666666, 0.379328125, ], 'mask': None, 'mask_path': None, 'confidence': 0.11891092360019684, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef668', 'attributes': {}, 'tags': [], 'label': 'municipal bus', 'bounding_box': [ 0.5685138888888889, 0.31115624999999997, 0.15155555555555564, 0.1060546875, ], 'mask': None, 'mask_path': None, 'confidence': 0.13214434683322906, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef669', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.4620833333333333, 0.3072578125, 0.11956944444444449, 0.15671093749999998, ], 'mask': None, 'mask_path': None, 'confidence': 0.29390838742256165, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef66a', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.5698333333333333, 0.33166406249999997, 0.11266666666666668, 0.166296875, ], 'mask': None, 'mask_path': None, 'confidence': 0.17897413671016693, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef66b', 'attributes': {}, 'tags': [], 'label': 'sedan', 'bounding_box': [ 0.5687222222222222, 0.3703671875, 0.08163888888888884, 0.14845312500000002, ], 'mask': None, 'mask_path': None, 'confidence': 0.2534674406051636, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef66c', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.6298055555555555, 0.34385937499999997, 0.19693055555555558, 0.2575, ], 'mask': None, 'mask_path': None, 'confidence': 0.3990810811519623, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef66d', 'attributes': {}, 'tags': [], 'label': 'sedan', 'bounding_box': [ 0.5770555555555555, 0.37800781250000004, 0.07523611111111106, 0.14806249999999999, ], 'mask': None, 'mask_path': None, 'confidence': 0.22299253940582275, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef66e', 'attributes': {}, 'tags': [], 'label': 'sedan', 'bounding_box': [ 0.5954861111111112, 0.38992187500000003, 0.06872222222222225, 0.15815624999999994, ], 'mask': None, 'mask_path': None, 'confidence': 0.10626339912414551, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef66f', 'attributes': {}, 'tags': [], 'label': 'minivan', 'bounding_box': [ 0.7587499999999999, 0.24249218749999998, 0.2410277777777779, 0.41490625000000003, ], 'mask': None, 'mask_path': None, 'confidence': 0.3947419822216034, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef670', 'attributes': {}, 'tags': [], 'label': 'illuminated tail lights', 'bounding_box': [ 0.2708333333333333, 0.477, 0.07577777777777778, 0.051179687500000084, ], 'mask': None, 'mask_path': None, 'confidence': 0.12947803735733032, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef671', 'attributes': {}, 'tags': [], 'label': 'yellow cab', 'bounding_box': [ 0.2574722222222222, 0.2663515625, 0.3298055555555555, 0.360625, ], 'mask': None, 'mask_path': None, 'confidence': 0.2435181438922882, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef672', 'attributes': {}, 'tags': [], 'label': 'yellow cab', 'bounding_box': [ 0.2605972222222222, 0.220921875, 0.32504166666666673, 0.4124921875, ], 'mask': None, 'mask_path': None, 'confidence': 0.3468368649482727, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef673', 'attributes': {}, 'tags': [], 'label': 'illuminated tail lights', 'bounding_box': [ 0.2707361111111111, 0.5041015625, 0.03811111111111111, 0.04823437500000001, ], 'mask': None, 'mask_path': None, 'confidence': 0.14190329611301422, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef674', 'attributes': {}, 'tags': [], 'label': 'illuminated tail lights', 'bounding_box': [ 0.4896388888888889, 0.4840703125, 0.063625, 0.037249999999999964, ], 'mask': None, 'mask_path': None, 'confidence': 0.1208077222108841, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef675', 'attributes': {}, 'tags': [], 'label': 'illuminated tail lights', 'bounding_box': [ 0.5343333333333333, 0.5111484374999999, 0.030236111111111085, 0.03962500000000002, ], 'mask': None, 'mask_path': None, 'confidence': 0.16234591603279114, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef676', 'attributes': {}, 'tags': [], 'label': 'illuminated tail lights', 'bounding_box': [ 0.5202361111111111, 0.5022734375, 0.05220833333333338, 0.061531249999999996, ], 'mask': None, 'mask_path': None, 'confidence': 0.12089472264051437, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef677', 'attributes': {}, 'tags': [], 'label': 'parking meter', 'bounding_box': [ 0.004472222222222223, 0.18684375, 0.9988055555555555, 0.534171875, ], 'mask': None, 'mask_path': None, 'confidence': 0.11007577180862427, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef678', 'attributes': {}, 'tags': [], 'label': 'station wagon', 'bounding_box': [ 0.008263888888888888, 0.1459609375, 0.9990833333333332, 0.6118125, ], 'mask': None, 'mask_path': None, 'confidence': 0.12653888761997223, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef679', 'attributes': {}, 'tags': [], 'label': 'tow truck', 'bounding_box': [ 0.6108055555555555, 0.23802343750000002, 0.38940277777777776, 0.412671875, ], 'mask': None, 'mask_path': None, 'confidence': 0.1311836540699005, 'index': None, }>, <Detection: { 'id': '67e1dae20b9d9cbc6d0ef67a', 'attributes': {}, 'tags': [], 'label': 'station wagon', 'bounding_box': [ -0.00175, 0.5913906250000001, 0.9975972222222222, 0.4046484375, ], 'mask': None, 'mask_path': None, 'confidence': 0.14577694237232208, 'index': None, }>, ], }>
Any model that can be run in a Hugging Face pipeline for the zero-shot-object-detection
task can be loaded as a Zoo model.
A good first entry point is to just do it and pass the model name into name_or_path
in the load_zoo_model
method of the dataset. If a Hugging Face model is not compatible with the integration, you'll see an error to the effect of:
ValueError: Unrecognized model in <whatever-model-name>
In this case, you will need to run the model manually. All this means is that you need to instantiate the model, it's processor, and write some logic to parse the model output a FiftyOne Detection. I'll show you how to this later on in this tutorial.
Ultralytics¶
FiftyOne integrates natively with Ultralytics, so you can load, fine-tune, and run inference with your favorite Ultralytics models on your FiftyOne datasets with just a few lines of code.
Check out the documention for our Ultralytics integration if you're interested in manually using an Ultralytics model rather than as a Zoo model.
!pip install ultralytics
import torch
import fiftyone.zoo as foz
device="cuda" if torch.cuda.is_available() else "cpu"
yolo_world = foz.load_zoo_model(
"yolov8s-world-torch",
classes=detection_classes,
device=device,
confidence_thresh=0.2
# install_requirements=True # uncomment to install the necessary requirements
)
dataset.apply_model(yolo_world, label_field="yolow_detections")
100% |█████████████████| 500/500 [13.9s elapsed, 0s remaining, 36.3 samples/s]
dataset.skip(42).first()['yolow_detections']
<Detections: { 'detections': [ <Detection: { 'id': '67e1dc450b9d9cbc6d0f0f61', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.7529513612389565, 0.24498116970062256, 0.2470119446516037, 0.40709465742111206, ], 'mask': None, 'mask_path': None, 'confidence': 0.5598529577255249, 'index': None, }>, <Detection: { 'id': '67e1dc450b9d9cbc6d0f0f62', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.6284473314881325, 0.34372271597385406, 0.19322185218334198, 0.25756362080574036, ], 'mask': None, 'mask_path': None, 'confidence': 0.5495367646217346, 'index': None, }>, ], }>
Plugins¶
You can also run zero-shot detection via FiftyOne Plugins. The following code will show you how to use the Florence2.
The example below will show you how to use the Florence2 plugin for zero-shot object detection and zero-shot open vocabulary detection. Begin by downloading the plugin and installing requirements:
!fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin
!fiftyone plugins requirements @jacbobmarks/florence2 --install
Next, instantiate the operator:
import fiftyone.operators as foo
MODEL_PATH ="microsoft/Florence-2-base-ft"
florence2_detection = foo.get_operator("@jacobmarks/florence2/detect_with_florence2")
Python version is above 3.10, patching the collections module.
/home/harpreet/miniconda3/envs/fiftyone/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:595: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead warnings.warn(
You should start a delegated service for this Operator, you can do that by opening your terminal and executing the following command:
fiftyone delegated launch
You'll use the await
syntax and pass the delegate=True
argument when running this plugin via notebook. By doing this you can run the model in the background without blocking your code or the application. Here's how you can use the plugin for zero-shot object detection:
await florence2_detection(
dataset,
model_path=MODEL_PATH,
detection_type="detection",
output_field="zero_shot_detections",
delegate=True
)
<fiftyone.operators.executor.ExecutionResult at 0x7bdb28191fd0>
You'll notice a progress bar in the temrinal window where you launched the delegated service incrementing as the model runs inference. You can track the progress there until completion there.
dataset.first()['zero_shot_detections']
<Detections: { 'detections': [ <Detection: { 'id': '67e1dcd87bdb18075d1d18d4', 'attributes': {}, 'tags': [], 'label': 'fire hydrant', 'bounding_box': [ 0.9385000228881836, 0.793500010172526, 0.059999942779541016, 0.07600004408094618, ], 'mask': None, 'mask_path': None, 'confidence': None, 'index': None, }>, <Detection: { 'id': '67e1dcd87bdb18075d1d18d5', 'attributes': {}, 'tags': [], 'label': 'street light', 'bounding_box': [ 0.23450000286102296, 0.4015000237358941, 0.012999987602233887, 0.010999976264105902, ], 'mask': None, 'mask_path': None, 'confidence': None, 'index': None, }>, <Detection: { 'id': '67e1dcd87bdb18075d1d18d6', 'attributes': {}, 'tags': [], 'label': 'traffic sign', 'bounding_box': [0.28949999809265137, 0.5125, 0.025, 0.03500001695421007], 'mask': None, 'mask_path': None, 'confidence': None, 'index': None, }>, ], }>
You can also use Florence2 for zero-shot open vocabulary detection. Note that the model only supports passing one candidate label for this task:
await florence2_detection(
dataset,
model_path=MODEL_PATH,
detection_type="open_vocabulary_detection",
text_prompt = "pedestrain in intersection", # the object you want to detect
output_field="open_detection",
delegate=True
)
<fiftyone.operators.executor.ExecutionResult at 0x7bdca36436d0>
dataset.skip(42).first()['open_detection']
<Detections: {'detections': []}>
Visit the Florence2 Plugin's GitHub Repo for more detail about using this plugin.
Arbitrary Models¶
If you do not want to use the model zoo or other integrations you can always run a zero-shot object detection model outside of FiftyOne. To have the model results available in FiftyOne there are some other manual steps you will need to complete. The process of converting predictions to FiftyOne format follows the same general pattern:
Standardize Bounding Box Format
- FiftyOne Detection labels expects bounding boxes in relative coordinates [0,1] (that is, they should be normalized relative to the input resolution)
- Format must be [top-left-x, top-left-y, width, height]
- Most models output absolute coordinates or different formats, so conversion is usually needed
Create Detection Objects
- Each Detection object needs three main components:
label
: the class namebounding_box
: the normalized coordinatesconfidence
: the detection score
- The individual Detection objects must be grouped into a Detections Field for each Sample.
- Each Detection object needs three main components:
Batch Processing Strategy
- Instead of updating samples one by one, collect all detections
- Use
dataset.set_values()
for efficient batch updates - This is much faster than individual
sample.save()
calls
The core workflow is:
- Get model predictions
- Convert coordinates to FiftyOne's expected format
- Create Detection objects
- Group them into Detections objects (one per Sample)
- Batch update the Dataset
This pattern remains the same regardless of the model you're using, whether it's from the Hugging Face Hub, Torch Hub, or some brand new SOTA model that you can only use via it's GitHub Repo. The only part that changes is how you extract and convert the specific model's output into FiftyOne Detection format.
Here is an example of running the OmDet model from a Hugging Face repo:
import torch
import fiftyone as fo
from PIL import Image
from transformers import AutoProcessor, OmDetTurboForObjectDetection
device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize model and processor
processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
model = OmDetTurboForObjectDetection.from_pretrained(
"omlab/omdet-turbo-swin-tiny-hf",
device_map=device)
filepaths = dataset.values("filepath")
all_detections = []
for filepath in filepaths:
# Load and process image
image = Image.open(filepath)
height, width = image.size[::-1] # Get dimensions in same format as target_sizes
inputs = processor(image, text=detection_classes, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_grounded_object_detection(
outputs,
text_labels=detection_classes,
target_sizes=[image.size[::-1]], # Keep model's expected format
threshold=0.3,
nms_threshold=0.3,
)[0]
scores = results["scores"].cpu().numpy()
boxes = results["boxes"].cpu().numpy()
text_labels = results["text_labels"]
detections = []
for score, class_name, box in zip(scores, text_labels, boxes):
x1, y1, x2, y2 = box
# First normalize all coordinates by their respective dimensions (x/width, y/height)
x1 = x1 / width
y1 = y1 / height
x2 = x2 / width
y2 = y2 / height
# Then calculate width and height as differences of normalized coordinates
w = x2 - x1 # width is right_x - left_x
h = y2 - y1 # height is bottom_y - top_y
detection = fo.Detection(
label=class_name,
bounding_box=[x1, y1, w, h],
confidence=float(score)
)
detections.append(detection)
all_detections.append(fo.Detections(detections=detections))
dataset.set_values("omdet_predictions", all_detections)
dataset.skip(42).first()['omdet_predictions']
<Detections: { 'detections': [ <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16cc', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.6251439094543457, 0.3419267018636068, 0.20014638900756831, 0.2533126407199436, ], 'mask': None, 'mask_path': None, 'confidence': 0.697697103023529, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16cd', 'attributes': {}, 'tags': [], 'label': 'sedan', 'bounding_box': [ 0.26125240325927734, 0.20989903344048394, 0.31747684478759763, 0.41080968644883903, ], 'mask': None, 'mask_path': None, 'confidence': 0.48624932765960693, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16ce', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.451490592956543, 0.3090142567952474, 0.12838659286499027, 0.15956696404351128, ], 'mask': None, 'mask_path': None, 'confidence': 0.47672003507614136, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16cf', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.7510287761688232, 0.249177000257704, 0.24851012229919434, 0.39371571011013456, ], 'mask': None, 'mask_path': None, 'confidence': 0.43500420451164246, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d0', 'attributes': {}, 'tags': [], 'label': 'yellow cab', 'bounding_box': [ 0.26125240325927734, 0.20989903344048394, 0.31747684478759763, 0.41080968644883903, ], 'mask': None, 'mask_path': None, 'confidence': 0.4230446517467499, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d1', 'attributes': {}, 'tags': [], 'label': 'minivan', 'bounding_box': [ 0.7510287761688232, 0.249177000257704, 0.24851012229919434, 0.39371571011013456, ], 'mask': None, 'mask_path': None, 'confidence': 0.39243268966674805, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d2', 'attributes': {}, 'tags': [], 'label': 'minivan', 'bounding_box': [ 0.6251439094543457, 0.3419267018636068, 0.20014638900756831, 0.2533126407199436, ], 'mask': None, 'mask_path': None, 'confidence': 0.38366976380348206, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d3', 'attributes': {}, 'tags': [], 'label': 'minivan', 'bounding_box': [ 0.451490592956543, 0.3090142567952474, 0.12838659286499027, 0.15956696404351128, ], 'mask': None, 'mask_path': None, 'confidence': 0.37925446033477783, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d4', 'attributes': {}, 'tags': [], 'label': 'storefront', 'bounding_box': [ 0.13803030252456666, 0.25423321194118925, 0.1100474715232849, 0.21988618638780383, ], 'mask': None, 'mask_path': None, 'confidence': 0.3741093575954437, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d5', 'attributes': {}, 'tags': [], 'label': 'station wagon', 'bounding_box': [ 0.26125240325927734, 0.20989903344048394, 0.31747684478759763, 0.41080968644883903, ], 'mask': None, 'mask_path': None, 'confidence': 0.37295597791671753, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d6', 'attributes': {}, 'tags': [], 'label': 'SUV', 'bounding_box': [ 0.5690805912017822, 0.3717874738905165, 0.0663639545440674, 0.121239980061849, ], 'mask': None, 'mask_path': None, 'confidence': 0.3514840304851532, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d7', 'attributes': {}, 'tags': [], 'label': 'pickup truck', 'bounding_box': [ 0.7510287761688232, 0.249177000257704, 0.24851012229919434, 0.39371571011013456, ], 'mask': None, 'mask_path': None, 'confidence': 0.33661335706710815, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d8', 'attributes': {}, 'tags': [], 'label': 'coupe', 'bounding_box': [ 0.26125240325927734, 0.20989903344048394, 0.31747684478759763, 0.41080968644883903, ], 'mask': None, 'mask_path': None, 'confidence': 0.3173862099647522, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16d9', 'attributes': {}, 'tags': [], 'label': 'pickup truck', 'bounding_box': [ 0.6251439094543457, 0.3419267018636068, 0.20014638900756831, 0.2533126407199436, ], 'mask': None, 'mask_path': None, 'confidence': 0.3107627332210541, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16da', 'attributes': {}, 'tags': [], 'label': 'station wagon', 'bounding_box': [ 0.6251439094543457, 0.3419267018636068, 0.20014638900756831, 0.2533126407199436, ], 'mask': None, 'mask_path': None, 'confidence': 0.30744266510009766, 'index': None, }>, <Detection: { 'id': '67e1de7e0b9d9cbc6d0f16db', 'attributes': {}, 'tags': [], 'label': 'station wagon', 'bounding_box': [ 0.7510287761688232, 0.249177000257704, 0.24851012229919434, 0.39371571011013456, ], 'mask': None, 'mask_path': None, 'confidence': 0.3015192151069641, 'index': None, }>, ], }>
Summary¶
This tutorial has introduced you to several approaches for performing zero-shot object detection using FiftyOne:
- Using pre-trained models through the Hugging Face integration
- Leveraging Ultralytics' YOLO-World model
- Exploring plugin-based solutions like Florence2
- Implementing custom zero-shot detection models
Next Steps¶
To continue learning, you can:
• Learn more about our integration with Hugging Face
• Check out the Zero-Shot Detection Plugin and learn more about Plugins in general
• Learn more about adding object detections to a Dataset
• Use the Moondream2 Plugin for zero-shot object detection
• Use the Florence2 Plugin for zero-shot object detection
• Learn how to evaluate object detections with FiftyOne
• Learn more in our blog, Zero-Shot Image Classification with Multimodal Models and FiftyOne
Remember that zero-shot detection is a rapidly evolving field - the approaches shown here are just the beginning. FiftyOne's flexible architecture allows you to easily incorporate new models and techniques as they become available.