Who this is for¶
This notebook is designed for:
Computer vision engineers with basic FiftyOne experience (can load datasets and use the App)
Practitioners interested in zero-shot computer vision approaches who may be new to segmentation tasks
Users looking to implement quick segmentation solutions without training custom models or creating labeled datasets
Assumed Knowledge¶
Computer Vision Concepts¶
Basic understanding of image segmentation (semantic, instance)
Familiarity with vision-language models and prompting
Understanding of coordinate systems in images
Technical Requirements¶
Intermediate Python programming skills
Experience with Jupyter notebooks
Basic understanding of PyTorch (for model usage)
FiftyOne Concepts¶
You should be familiar with:
Time to Complete¶
~45-60 minutes
Required Packages¶
It's recommended to use a virtual environment with FiftyOne already installed. Additionally, you'll need:
# Install Florence2 plugin requirements
fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin
fiftyone plugins requirements @jacobmarks/florence2 --install
# Install Moondream2 plugin requirements
fiftyone plugins download https://github.com/harpreetsahota204/moondream2-plugin
fiftyone plugins requirements @harpreetsahota/moondream2 --install
# Install SAM2
pip install "git+https://github.com/facebookresearch/sam2.git#egg=sam-2"
# Install FastSAM
pip install ultralytics
Content Overview¶
Zero-Shot Segmentation Introduction: Understanding the basics of zero-shot segmentation and its applications
Phrase Grounding Segmentation: Using Florence2 to segment images based on text descriptions
Moondream + SAM2 Integration: Combining automatic keypoint detection with advanced segmentation
FastSAM Implementation: Using point-based prompts for quick and efficient segmentation
Zero-Shot Segmentation¶
Zero-shot segmentation is a computer vision task that aims to segment objects or regions in images without any training samples for those specific categories. It enables models to perform instance, semantic, or panoptic segmentation for novel categories by transferring visual knowledge learned from seen categories to unseen ones. Prompt-based zero-shot segmentation uses prompts to guide the segmentation process at test time without requiring retraining for new categories. This approach allows a single trained model to handle various segmentation tasks dynamically.
Types of Prompts¶
Text Prompts
- Free-text descriptions that specify what to segment in an image
- The model uses pre-trained knowledge of text-image relationships to identify and segment the described objects
Image Prompts
- Visual examples that show what to segment
- Particularly useful when the target is difficult to describe in words
- Can be a reference image containing the object of interest
- The model compares the visual features between the prompt image and the target image to identify similar regions
Hybrid Approaches
- Some systems can accept either text or image prompts for the same model
- CLIPSeg is an example of a model that works with both text and image prompts by adding a decoder to CLIP
Let's begin by downloading a dataset from the FiftyOne Dataset Zoo.¶
You'll notice I have passed several arguments to the load_zoo_dataset
function:
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset(
"quickstart",
max_samples=25,
shuffle=True,
dataset_name="mini_quickstart",
overwrite=True,
)
Overwriting existing directory '/home/harpreet/fiftyone/quickstart' Downloading dataset to '/home/harpreet/fiftyone/quickstart' Downloading dataset... 100% |████| 187.5Mb/187.5Mb [908.5ms elapsed, 0s remaining, 206.4Mb/s] Extracting dataset... Parsing dataset metadata Found 200 samples Dataset info written to '/home/harpreet/fiftyone/quickstart/info.json' Loading 'quickstart' 100% |███████████████████| 25/25 [263.8ms elapsed, 0s remaining, 94.8 samples/s] Dataset 'mini_quickstart' created
We'll need metadata, such as each Sample's height and width later, so we can use the compute_metadata
method of the Dataset to accomplish this:
dataset.compute_metadata()
Computing metadata... 100% |███████████████████| 25/25 [12.7ms elapsed, 0s remaining, 2.0K samples/s]
Here's what the metadata for the first Sample looks like:
dataset.first().metadata
<ImageMetadata: { 'size_bytes': 157534, 'mime_type': 'image/jpeg', 'width': 640, 'height': 489, 'num_channels': 3, }>
Phrase Grounding Segmentation¶
Phrase grounding segmentation extends traditional phrase grounding by not only localizing objects mentioned in text but also generating pixel-level segmentation masks for those objects. While phrase grounding typically produces bounding boxes around regions corresponding to textual phrases, phrase grounding segmentation aims to create fine-grained segmentation masks that precisely delineate the boundaries of the referenced objects.
This approach enables more precise visual understanding by:
- Associating specific words or phrases with their corresponding image regions
- Generating pixel-accurate segmentation masks rather than just bounding boxes
- Creating a more detailed alignment between language and visual content
Plugins¶
FiftyOne plugins are powerful extensions that allow users to customize and enhance the functionality of the FiftyOne App.
Plugins can be written in Python, JavaScript, or a combination of both, enabling users to add new features, create integrations with other tools and APIs, render custom panels, and add custom actions to menus. They are composed of panels, operators, and components, which together allow for building full-featured interactive data applications tailored to specific use cases. Plugins can range from simple actions like adding a checkbox to complex workflows such as requesting annotations from a configurable backend. This extensibility makes FiftyOne highly adaptable to various computer vision tasks and workflows, limited only by the user's imagination.
We'll use the Plugin framework via the FiftyOne SDK, and you can refer to the docs on using the Plugin Frame in the FiftyOne App
Florence2 Plugin¶
The Florence2 Plugin integrates Microsoft's Florence2 Vision-Language Model with FiftyOne datasets, offering several powerful computer vision capabilities.
One of these tasks is referring segmentation, which allows you to segment specific regions in an image based on natural language descriptions. This is particularly useful when you need to segment specific parts of an image based on textual descriptions, allowing for region identification using natural language. It can be used in two ways:
• Using a direct expression through the expression
parameter
• Using an existing expression field in your dataset via the expression_field
parameter
Note: Referring segmentation is a hard task in Visual AI, and as powerful as the Florence2 model is, the results are not always the best. It's a good idea to be precise with your open vocabulary prompt and use the shortest caption possible for each Sample.
Let's start by setting an enviornment variable, downloading the plugin, and installing it's requirements.
# set environment variable
import os
os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'
# download the plugin
!fiftyone plugins download https://github.com/jacobmarks/fiftyone_florence2_plugin
# install requirements for the plugin
!fiftyone plugins requirements @jacobmarks/florence2 --install
With the plugin installed, you can instantiate the Operator like this:
import fiftyone.operators as foo
MODEL_PATH ="microsoft/Florence-2-large-ft"
florence2_referring_segmentation = foo.get_operator("@jacobmarks/florence2/referring_expression_segmentation_with_florence2")
To use the operator, you will need to start a Delegated service by opening your terminal and running the following command:
fiftyone delegated launch
Then, you can call the Operator by running the following cell:
await florence2_referring_segmentation(
dataset,
model_path=MODEL_PATH,
expression="human",
output_field="open_expression_segmentations",
delegate=True
)
You can examine the results of the first Sample like so:
dataset.first()['open_expression_segmentations']
<Polylines: { 'polylines': [ <Polyline: { 'id': '67eac88e0c94125554ed2dd7', 'attributes': {}, 'tags': [], 'label': 'object_1', 'points': [ [ [0.184499990940094, 0.18949999350955393], [0.28250000476837156, 0.08650000012481628], [0.4114999771118164, 0.03249999984397966], [0.5014999866485595, 0.013500000070209153], [0.6154999732971191, 0.03950000053046915], [0.7244999885559082, 0.08650000012481628], [0.8255000114440918, 0.1804999915124936], [0.8885000228881836, 0.29449999600587934], [0.9215000152587891, 0.4465000019970603], [0.9184999465942383, 0.5855000189720732], [0.8614999771118164, 0.7404999840235174], [0.7494999885559082, 0.8744999860205777], [0.6014999866485595, 0.9414999870191079], [0.46050000190734863, 0.9454999740382157], [0.3534999847412109, 0.9054999790308665], [0.3534999847412109, 0.9054999790308665], [0.3504999876022339, 0.9034999855213126], [0.3504999876022339, 0.9014999920117587], [0.3534999847412109, 0.8964999770338062], [0.3534999847412109, 0.8944999835242523], [0.3534999847412109, 0.8904999965051444], [0.3534999847412109, 0.885499981527192], [0.3534999847412109, 0.8814999945080841], [0.3534999847412109, 0.8795000009985302], [0.3534999847412109, 0.8744999860205777], [0.3534999847412109, 0.8704999990014698], [0.3534999847412109, 0.8654999840235174], [0.3534999847412109, 0.8614999970044095], [0.3534999847412109, 0.856499982026457], [0.3534999847412109, 0.8524999950073492], [0.3534999847412109, 0.8474999800293967], [0.3534999847412109, 0.8434999930102889], [0.3534999847412109, 0.8384999780323364], [0.3534999847412109, 0.8344999910132286], [0.3534999847412109, 0.829499976035276], [0.3534999847412109, 0.8234999955066142], [0.3534999847412109, 0.8184999805286618], [0.3534999847412109, 0.8144999935095539], [0.3534999847412109, 0.8094999785316015], [0.3534999847412109, 0.8054999915124936], [0.3534999847412109, 0.8004999765345412], [0.3534999847412109, 0.7964999895154333], [0.3534999847412109, 0.7914999745374808], [0.3534999847412109, 0.7874999875183729], [0.3534999847412109, 0.7824999725404205], [0.3534999847412109, 0.7764999920117587], [0.3534999847412109, 0.7714999770338062], [0.3534999847412109, 0.7674999900146984], [0.3534999847412109, 0.7624999750367459], [0.3534999847412109, 0.7584999880176381], [0.3534999847412109, 0.7534999730396856], [0.3534999847412109, 0.7494999860205777], [0.3534999847412109, 0.7444999710426252], [0.3534999847412109, 0.7404999840235174], [0.3534999847412109, 0.7354999690455649], [0.3534999847412109, 0.731499982026457], [0.3534999847412109, 0.7294999885169031], [0.3534999847412109, 0.7244999735389507], [0.3534999847412109, 0.7204999865198428], [0.3534999847412109, 0.7154999715418904], [0.3534999847412109, 0.7114999845227825], [0.3534999847412109, 0.70649996954483], [0.3534999847412109, 0.7024999825257221], [0.3534999847412109, 0.6974999675477697], [0.3534999847412109, 0.6934999805286618], [0.3534999847412109, 0.6914999870191079], [0.3534999847412109, 0.6864999720411554], [0.3534999847412109, 0.6824999850220476], [0.3534999847412109, 0.6804999915124936], [0.3534999847412109, 0.6754999765345412], [0.3534999847412109, 0.6714999895154333], [0.3534999847412109, 0.6664999745374808], [0.3534999847412109, 0.6624999875183729], [0.3534999847412109, 0.6574999725404205], [0.3534999847412109, 0.6534999855213126], [0.3534999847412109, 0.6514999920117587], [0.3534999847412109, 0.6464999770338062], [0.3534999847412109, 0.6444999835242523], [0.3534999847412109, 0.6424999900146984], [0.3534999847412109, 0.6404999965051444], [0.3534999847412109, 0.635499981527192], [0.3534999847412109, 0.6334999880176381], [0.3534999847412109, 0.6314999945080841], [0.3534999847412109, 0.6264999795301317], [0.3534999847412109, 0.6244999860205777], [0.3534999847412109, 0.6224999925110237], [0.3534999847412109, 0.6174999775330713], [0.3534999847412109, 0.6134999905139634], [0.3534999847412109, 0.608499975536011], [0.3534999847412109, 0.6044999885169031], [0.3534999847412109, 0.6024999950073492], [0.3534999847412109, 0.5974999800293967], [0.3534999847412109, 0.5954999865198428], [0.3534999847412109, 0.5934999930102889], [0.3534999847412109, 0.591499999500735], [0.3534999847412109, 0.5884999780323364], [0.3534999847412109, 0.5864999845227825], [0.3534999847412109, 0.5844999910132286], [0.3534999847412109, 0.5824999975036745], [0.3534999847412109, 0.579499976035276], [0.3534999847412109, 0.5774999825257221], [0.3534999847412109, 0.5754999890161682], [0.3534999847412109, 0.5734999955066142], [0.3534999847412109, 0.5715000019970603], [0.3534999847412109, 0.5684999805286618], [0.3534999847412109, 0.5664999870191079], [0.3534999847412109, 0.5644999935095539], [0.3534999847412109, 0.5625], [0.3534999847412109, 0.5605000064904461], [0.3534999847412109, 0.5574999850220476], [0.3534999847412109, 0.5554999915124936], [0.3534999847412109, 0.5534999980029397], [0.3534999847412109, 0.5515000044933858], [0.3534999847412109, 0.5484999830249873], [0.3534999847412109, 0.5464999895154333], [0.3534999847412109, 0.5444999960058794], [0.3534999847412109, 0.5425000024963255], [0.3534999847412109, 0.5394999810279268], [0.3534999847412109, 0.5374999875183729], [0.3534999847412109, 0.535499994008819], [0.3534999847412109, 0.533500000499265], [0.3534999847412109, 0.5315000069897111], [0.3534999847412109, 0.5284999855213126], [0.3534999847412109, 0.5264999920117587], [0.3534999847412109, 0.5244999985022047], [0.3534999847412109, 0.5225000049926508], [0.3534999847412109, 0.5204999802790292], [0.3534999847412109, 0.5174999900146984], [0.3534999847412109, 0.5154999965051444], [0.3534999847412109, 0.5135000029955905], [0.3534999847412109, 0.5114999782819689], [0.3534999847412109, 0.5084999880176381], [0.3534999847412109, 0.5064999945080841], [0.3534999847412109, 0.5045000009985302], [0.3534999847412109, 0.5025000074889763], [0.3534999847412109, 0.4994999860205777], [0.3534999847412109, 0.4974999925110238], [0.3534999847412109, 0.49549999900146985], [0.3534999847412109, 0.4935000054919159], [0.3534999847412109, 0.4914999807782944], [0.3534999847412109, 0.4884999905139634], [0.3534999847412109, 0.4864999970044095], [0.3534999847412109, 0.48450000349485556], [0.3534999847412109, 0.482499978781234], [0.3534999847412109, 0.4804999852716801], [0.3534999847412109, 0.47849999176212615], [0.3534999847412109, 0.47550000149779525], [0.3534999847412109, 0.4735000079882413], [0.3534999847412109, 0.4714999832746198], [0.3534999847412109, 0.46949998976506585], [0.3534999847412109, 0.46749999625551186], [0.3534999847412109, 0.46550000274595793], [0.3534999847412109, 0.4634999780323364], [0.3534999847412109, 0.46149998452278246], [0.3534999847412109, 0.4594999910132285], [0.3534999847412109, 0.4574999975036746], [0.3534999847412109, 0.4545000072393437], [0.3534999847412109, 0.45249998252572216], [0.3534999847412109, 0.4504999890161682], [0.3534999847412109, 0.4484999955066143], [0.3534999847412109, 0.4465000019970603], [0.3534999847412109, 0.44450000848750637], [0.3534999847412109, 0.44249998377388483], [0.3534999847412109, 0.43949999350955393], [0.3534999847412109, 0.4375], [0.3534999847412109, 0.43550000649044607], [0.3534999847412109, 0.43349998177682453], [0.3534999847412109, 0.4314999882672706], [0.3534999847412109, 0.42949999475771666], [0.3534999847412109, 0.42750000124816273], [0.3534999847412109, 0.42550000773860874], [0.3534999847412109, 0.4234999830249872], [0.3534999847412109, 0.42149998951543327], [0.3534999847412109, 0.4184999992511024], [0.3534999847412109, 0.41650000574154844], [0.3534999847412109, 0.4144999810279269], [0.3534999847412109, 0.41249998751837297], [0.3534999847412109, 0.41049999400881904], [0.3534999847412109, 0.4085000004992651], [0.3534999847412109, 0.40650000698971117], [0.3534999847412109, 0.4044999822760896], [0.3534999847412109, 0.40249998876653564], [0.3534999847412109, 0.39949999850220475], [0.3534999847412109, 0.3975000049926508], [0.3534999847412109, 0.3954999802790293], [0.3534999847412109, 0.39349998676947534], [0.3534999847412109, 0.3914999932599214], [0.3534999847412109, 0.3894999997503675], [0.3534999847412109, 0.38750000624081354], [0.3534999847412109, 0.38549998152719195], [0.3534999847412109, 0.383499988017638], [0.3534999847412109, 0.3814999945080841], [0.3534999847412109, 0.37950000099853015], [0.3534999847412109, 0.3775000074889762], [0.3534999847412109, 0.3754999827753547], [0.3534999847412109, 0.3724999925110238], [0.3534999847412109, 0.37049999900146985], [0.3534999847412109, 0.3685000054919159], [0.3534999847412109, 0.3664999807782944], [0.3534999847412109, 0.3644999872687404], [0.3534999847412109, 0.36249999375918646], [0.3534999847412109, 0.3605000002496325], [0.3534999847412109, 0.3585000067400786], [0.3534999847412109, 0.35649998202645705], [0.3534999847412109, 0.35349999176212615], [0.3534999847412109, 0.3514999982525722], [0.3534999847412109, 0.3495000047430183], [0.3534999847412109, 0.34749998002939675], [0.3534999847412109, 0.3454999865198428], [0.3534999847412109, 0.34349999301028883], [0.3534999847412109, 0.3414999995007349], [0.3534999847412109, 0.33950000599118096], [0.3534999847412109, 0.33649998452278246], [0.3534999847412109, 0.3344999910132285], [0.3534999847412109, 0.3324999975036746], [0.3534999847412109, 0.33050000399412066], [0.3534999847412109, 0.32850001048456673], [0.3534999847412109, 0.3264999857709452], [0.3534999847412109, 0.32449999226139126], [0.3534999847412109, 0.32249999875183727], [0.3534999847412109, 0.32050000524228334], [0.3534999847412109, 0.31749998377388483], [0.3534999847412109, 0.3154999902643309], [0.3534999847412109, 0.31349999675477697], [0.3534999847412109, 0.31150000324522303], [0.3534999847412109, 0.3095000097356691], [0.3534999847412109, 0.30749998502204756], [0.3534999847412109, 0.30549999151249363], [0.3534999847412109, 0.3034999980029397], [0.3534999847412109, 0.3015000044933857], [0.3534999847412109, 0.2984999830249872], [0.3534999847412109, 0.29649998951543327], [0.3534999847412109, 0.29449999600587934], [0.3534999847412109, 0.2925000024963254], [0.3534999847412109, 0.2905000089867715], [0.3534999847412109, 0.28849998427314993], [0.3534999847412109, 0.286499990763596], [0.3534999847412109, 0.28449999725404207], [0.3534999847412109, 0.28150000698971117], [0.3534999847412109, 0.2794999822760896], [0.3534999847412109, 0.27749998876653564], [0.3534999847412109, 0.2754999952569817], [0.3534999847412109, 0.2735000017474278], [0.3534999847412109, 0.27150000823787385], [0.3534999847412109, 0.2694999835242523], [0.3534999847412109, 0.2674999900146984], [0.3534999847412109, 0.26549999650514444], [0.3534999847412109, 0.26250000624081354], [0.3534999847412109, 0.2604999971292258], [0.3534999847412109, 0.25850000361967185], [0.3534999847412109, 0.2564999945080841], [0.3534999847412109, 0.25450000099853015], [0.3534999847412109, 0.25249999188694244], [0.3534999847412109, 0.2504999983773885], [0.3534999847412109, 0.24849998926580075], [0.3534999847412109, 0.24549999900146985], [0.3534999847412109, 0.24349998988988208], [0.3534999847412109, 0.24149999638032815], [0.3534999847412109, 0.23950000287077422], [0.3534999847412109, 0.23749999375918648], [0.3534999847412109, 0.23550000024963255], [0.3534999847412109, 0.2334999911380448], [0.3534999847412109, 0.23149999762849086], [0.3534999847412109, 0.22950000411893692], [0.3534999847412109, 0.22649999825257222], [0.3534999847412109, 0.22449998914098446], [0.3534999847412109, 0.22249999563143052], [0.3534999847412109, 0.2205000021218766], [0.3534999847412109, 0.21849999301028886], [0.3534999847412109, 0.21649999950073492], [0.3534999847412109, 0.2144999903891472], [0.3534999847412109, 0.21249999687959323], [0.3534999847412109, 0.2105000033700393], [0.3534999847412109, 0.2074999975036746], [0.3534999847412109, 0.20550000399412066], [0.3534999847412109, 0.20349999488253293], [0.3534999847412109, 0.20150000137297897], [0.3534999847412109, 0.19949999226139123], [0.3534999847412109, 0.1974999987518373], [0.3534999847412109, 0.19549998964024956], [0.3534999847412109, 0.19349999613069563], [0.3534999847412109, 0.19150000262114167], [0.3534999847412109, 0.18849999675477697], [0.3534999847412109, 0.18650000324522303], [0.3534999847412109, 0.1844999941336353], [0.3534999847412109, 0.18250000062408137], [0.3534999847412109, 0.1804999915124936], [0.3534999847412109, 0.17849999800293967], [0.3534999847412109, 0.17650000449338574], [0.3534999847412109, 0.174499995381798], [0.3534999847412109, 0.17250000187224407], [0.3534999847412109, 0.16949999600587934], [0.3534999847412109, 0.1675000024963254], [0.3534999847412109, 0.16549999338473767], [0.3534999847412109, 0.16349999987518374], [0.3534999847412109, 0.16149999076359597], [0.3534999847412109, 0.15949999725404204], [0.3534999847412109, 0.1575000037444881], [0.3534999847412109, 0.15549999463290037], [0.3534999847412109, 0.15250000436856948], [0.3534999847412109, 0.1504999952569817], [0.3534999847412109, 0.14850000174742778], [0.3534999847412109, 0.1904999902643309], [0.3534999847412109, 0.18849999675477697], [0.3514999866485596, 0.18650000324522303], [0.3534999847412109, 0.18650000324522303], [0.3514999866485596, 0.18650000324522303], [0.3514999866485596, 0.1844999941336353], [0.3514999866485596, 0.18250000062408137], [0.3514999866485596, 0.1804999915124936], [0.3514999866485596, 0.17849999800293967], [0.3514999866485596, 0.17650000449338574], [0.3514999866485596, 0.174499995381798], [0.3514999866485596, 0.17250000187224407], [0.3514999866485596, 0.16949999600587934], [0.3514999866485596, 0.1675000024963254], [0.3514999866485596, 0.16549999338473767], [0.3514999866485596, 0.16349999987518374], [0.3514999866485596, 0.16149999076359597], [0.3514999866485596, 0.15949999725404204], [0.3514999866485596, 0.1575000037444881], [0.3514999866485596, 0.15549999463290037], [0.3514999866485596, 0.15250000436856948], [0.3514999866485596, 0.14850000174742778], [0.3534999847412109, 0.14649999263584004], [0.3534999847412109, 0.1444999991262861], [0.3534999847412109, 0.14250000561673218], [0.3534999847412109, 0.14049999650514441], [0.3534999847412109, 0.13850000299559048], [0.3534999847412109, 0.13649999388400275], [0.3534999847412109, 0.13350000361967185], [0.3534999847412109, 0.1314999945080841], [0.3534999847412109, 0.12950000099853015], [0.3534999847412109, 0.12749999968795933], [0.3534999847412109, 0.12549999837738848], [0.3534999847412109, 0.12349999706681765], [0.3534999847412109, 0.1214999957562468], [0.3534999847412109, 0.118499997690899], [0.3534999847412109, 0.11649999638032815], [0.3534999847412109, 0.11449999506975732], [0.3534999847412109, 0.11250000156020339], [0.3534999847412109, 0.11050000024963254], [0.3534999847412109, 0.1084999989390617], [0.3534999847412109, 0.10649999762849086], [0.3534999847412109, 0.10449999631792002], [0.3534999847412109, 0.10249999500734919], [0.3534999847412109, 0.10050000149779524], [0.3534999847412109, 0.09749999563143054], [0.3534999847412109, 0.09550000212187659], [0.3534999847412109, 0.09350000081130576], [0.3534999847412109, 0.09149999950073492], [0.3534999847412109, 0.08949999819016408], [0.3534999847412109, 0.08650000012481628], [0.3534999847412109, 0.08449999881424543], [0.3534999847412109, 0.0824999975036746], [0.3534999847412109, 0.08049999619310375], [0.3534999847412109, 0.07850000268354981], [0.3534999847412109, 0.07650000137297898], [0.3534999847412109, 0.07450000006240813], [0.3534999847412109, 0.0724999987518373], [0.3534999847412109, 0.07049999744126646], [0.3534999847412109, 0.06849999613069561], [0.3534999847412109, 0.06650000262114168], [0.3534999847412109, 0.06449999741006239], [0.3534999847412109, 0.0625], [0.3534999847412109, 0.06049999868942916], [0.3534999847412109, 0.05849999737885832], [0.3534999847412109, 0.056499999968795935], [0.3534999847412109, 0.05449999865822509], [0.3534999847412109, 0.05249999734765425], [0.3534999847412109, 0.05049999993759186], [0.3534999847412109, 0.04849999862702103], [0.3534999847412109, 0.04650000121695864], [0.3534999847412109, 0.0444999999063878], [0.3534999847412109, 0.042499998595816955], [0.3534999847412109, 0.03950000053046915], [0.3534999847412109, 0.03749999921989831], [0.3534999847412109, 0.03549999790932747], [0.3534999847412109, 0.03350000049926508], [0.3534999847412109, 0.03149999918869424], [0.3534999847412109, 0.02949999982837763], [0.3534999847412109, 0.027500000468061014], [0.3534999847412109, 0.025499999157490176], [0.3534999847412109, 0.02349999979717356], [0.3534999847412109, 0.021500000436856948], [0.3534999847412109, 0.019499999126286107], [0.3534999847412109, 0.01749999976596949], [0.3534999847412109, 0.015499999430525766], [0.3534999847412109, 0.012499999414923732], [0.3534999847412109, 0.010500000054607118], [0.3534999847412109, 0.008499999719163391], [0.3534999847412109, 0.006499999871283221], [0.3534999847412109, 0.004500000023403051], [0.3534999847412109, 0.002499999931741102], [0.3534999847412109, 0.0004999999924427649], [0.3534999847412109, 0.0004999999924427649], [0.3514999866485596, 0.0004999999924427649], [0.3534999847412109, 0.0015000000078010168], [0.3534999847412109, 0.0004999999924427649], [0.3554999828338623, 0.0004999999924427649], [0.3534999847412109, 0.9994999860205777], ], ], 'confidence': None, 'index': None, 'closed': True, 'filled': True, }>, ], }>
You can also use the Florence2 Plugin when you have existing captions on your dataset. We don't have those here, so let's generate these captions and then use that for segmentation. Start by instantiating the Operator that will use the Florence2 model to generate captions:
import fiftyone.operators as foo
florence2_captioning = foo.get_operator("@jacobmarks/florence2/caption_with_florence2")
And calling the operator, as you've done previously:
await florence2_captioning(
dataset,
model_path=MODEL_PATH,
detail_level="basic",
output_field="basic_caption",
delegate=True
)
dataset.first()['basic_caption']
'A cat curled up in a bowl on a wooden floor.'
await florence2_referring_segmentation(
dataset,
model_path=MODEL_PATH,
expression_field="basic_caption", #must be a field on your dataset
output_field="expression_field_segmentations",
delegate=True
)
dataset.first()['expression_field_segmentations']
<Polylines: { 'polylines': [ <Polyline: { 'id': '67eac93f0c94125554ed2df0', 'attributes': {}, 'tags': [], 'label': 'object_1', 'points': [ [ [0.2244999885559082, 0.13549999712922578], [0.3494999885559082, 0.047499997971735604], [0.4485000133514404, 0.01749999976596949], [0.5275000095367431, 0.015499999430525766], [0.6174999713897705, 0.03649999856461289], [0.7184999942779541, 0.08350000205946846], [0.8164999961853028, 0.1704999927606563], [0.8845000267028809, 0.28050001023493415], [0.9204999923706054, 0.4265000044933857], [0.9234999656677246, 0.5385000154772175], [0.8984999656677246, 0.6594999660499744], [0.8545000076293945, 0.7514999795301317], [0.7565000057220459, 0.8724999925110237], [0.6065000057220459, 0.9414999870191079], [0.47049999237060547, 0.9484999955066142], [0.3524999856948853, 0.908500000499265], [0.3524999856948853, 0.908500000499265], [0.3494999885559082, 0.908500000499265], [0.2875, 0.9014999920117587], [0.281499981880188, 0.8614999970044095], [0.2244999885559082, 0.8054999915124936], [0.1375, 0.6684999680470347], [0.10749999284744263, 0.5405000089867714], [0.10449999570846558, 0.41949999600587934], [0.13450000286102295, 0.28050001023493415], ], ], 'confidence': None, 'index': None, 'closed': True, 'filled': True, }>, ], }>
Moondream + SAM2¶
This next workflow will show you how to leverage the Moondream2 plugin for FiftyOne alongside SAM2 from the FiftyOne Model Zoo for zero-shot segmentation.
The process works by first using Moondream2 to automatically analyze your images and generate Keypoints of interest in a zero-shot fashion, requiring no training data or manual annotation. These points then serve as a prompt for SAM2, which uses them to generate segmentation masks around the detected objects or regions.
First, install the Moondream2 plugin and it's requirements:
# download the plugin from the github repository
!fiftyone plugins download https://github.com/harpreetsahota204/moondream2-plugin
# install requirements for the plugin
!fiftyone plugins requirements @harpreetsahota/moondream2 --install
With the plugin installed, you can instantiate the Operator like this:
import fiftyone.operators as foo
moondream_operator = foo.get_operator("@harpreetsahota/moondream2/moondream")
To use the operator, you will need to start a Delegated service by opening your terminal and running the following command:
fiftyone delegated launch
Then, you can call the Operator by running the following cell:
await moondream_operator(
dataset,
revision="2025-01-09",
operation="point",
output_field="moondream_point",
delegate=True,
object_type="person"
)
You'll notice a progress bar in the terminal where you launched the delegated service - you can monitor this to see when the operation completes
Use the reload
method of the Dataset to reload any in-memory samples from the database.
dataset.reload()
This should create a point on top of each person detected in the image. In this Dataset, the first result didn't have any class of person
. To demonstrate what the Keypoint looks like from Moondream looks like we can get the jth element from the Dataset using the skip
method of the Dataset combined with the first
method of the Dataset. Notice, because there can be multiple keypoints in an image, the Keypoint labels are wrapped in a Keypoints Label
dataset.skip(15).first()['moondream_point']
<Keypoints: { 'keypoints': [ <Keypoint: { 'id': '67eac9a60c94125554ed2e29', 'attributes': {}, 'tags': [], 'label': 'person', 'points': [[0.59765625, 0.5869140625]], 'confidence': None, 'index': None, }>, <Keypoint: { 'id': '67eac9a60c94125554ed2e2a', 'attributes': {}, 'tags': [], 'label': 'person', 'points': [[0.2314453125, 0.6767578125]], 'confidence': None, 'index': None, }>, ], }>
To use SAM2 from the FiftyOne Model Zoo you need to first install its dependencies. You can do so by running the following command:
pip install "git+https://github.com/facebookresearch/sam2.git#egg=sam-2
Please refer to the SAM2 GitHub repo for details and any troubleshooting. You can refer to the Model Zoo documentation for more information about which checkpoints are available in the FiftyOne Model Zoo.
sam2_model = foz.load_zoo_model("segment-anything-2.1-hiera-base-plus-image-torch")
dataset.apply_model(
sam2_model,
label_field="sam_segmentations",
prompt_field="moondream_point",
)
dataset.skip(15).first()['sam_segmentations']
<Detections: { 'detections': [ <Detection: { 'id': '67eacaa521dbe34b2775a1ec', 'attributes': {}, 'tags': [], 'label': 'person', 'bounding_box': [ 0.578125, 0.5247058823529411, 0.0421875, 0.1811764705882353, ], 'mask': array([[False, False, False, ..., False, False, False], [False, False, False, ..., False, False, False], [False, False, False, ..., False, False, False], ..., [False, False, False, ..., True, True, True], [False, False, False, ..., True, True, True], [False, False, False, ..., False, False, False]]), 'mask_path': None, 'confidence': 0.8452418446540833, 'index': None, }>, <Detection: { 'id': '67eacaa521dbe34b2775a1ed', 'attributes': {}, 'tags': [], 'label': 'person', 'bounding_box': [ 0.2109375, 0.6470588235294118, 0.0390625, 0.05411764705882353, ], 'mask': array([[False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, False, False, False, False, False, False], [False, False, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False], [False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False], [False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False], [False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False], [False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False], [False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False], [False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False], [False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False], [False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False], [ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], [ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], [False, True, True, True, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]]), 'mask_path': None, 'confidence': 0.8010133504867554, 'index': None, }>, ], }>
Prompting with Keypoints¶
When working with keypoints in FiftyOne and Hugging Face segmentation models, you need to perform some conversion.
FiftyOne's Keypoint class stores points in normalized coordinates within the [0,1] x [0,1] range, regardless of image dimensions. This normalization enables consistent representation across images of different sizes.
When a model requires absolute pixel coordinates to generate segmentation masks, you'll need to perform coordinate conversion when moving between these systems. Feeding those points to a segmentation model requires transforming the normalized coordinates back to absolute pixel values using the image's actual dimensions from its metadata.
How to parse segmentation masks that are point coordinates¶
FastSAM from Ultralytics¶
FastSAM outputs segmentation masks as normalized coordinate arrays. Each mask is represented as an array of (x,y) coordinates defining the boundary of a detected object. These coordinates are normalized to [0,1] range and stored in NumPy arrays.
When working with segmentation models that output point coordinates (boundary points of objects), here's what you need to know to display them in FiftyOne:
Model outputs (often NumPy arrays or tensors) need to be converted to standard Python lists of coordinates.
Ensure coordinates are normalized to [0,1] range if they aren't already.
FiftyOne's Polyline expects a specific nesting structure - your points must be organized as a list of shapes, where each shape is a list of points.
Specify that your polylines should be closed (connecting last point to first) and filled to properly represent segmentation masks.
Store your polylines in a Polylines field to enable proper visualization in the FiftyOne UI.
Note: FastSAM can accept text input and boundin boxes as prompts. Refer to the Ultralytics documentation to learn more.
import torch
from ultralytics import FastSAM
import fiftyone as fo
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load FastSAM model
model = FastSAM("FastSAM-s.pt")
# Process samples in dataset
for sample in dataset.iter_samples(progress=True):
# Skip samples without keypoints
if not hasattr(sample, "moondream_point") or not sample.moondream_point.keypoints:
continue
# Get image path and dimensions
image_path = sample.filepath
image_width = sample.metadata.width
image_height = sample.metadata.height
# Process all keypoints in the sample
all_keypoints = sample.moondream_point.keypoints
# Collect all points from all keypoint objects
all_pixel_points = []
all_labels = []
for keypoint_obj in all_keypoints:
points = keypoint_obj.points
# Convert to pixel coordinates
pixel_points = [
[int(point[0] * image_width), int(point[1] * image_height)]
for point in points
]
all_pixel_points.extend(pixel_points)
all_labels.extend([1] * len(pixel_points)) # 1 for foreground
# Run inference with all points
results = model(image_path,
device=device,
retina_masks=True,
points=all_pixel_points,
labels=all_labels,
conf=0.51,
iou=0.51
)
result = results[0]
# Check if masks were generated
if hasattr(result, 'masks') and result.masks:
masks = result.masks.xyn
# Create polyline objects for each mask
polylines = []
for mask in masks:
# Convert NumPy arrays to plain Python lists of floats with list comprehension
points_list = [[float(point[0]), float(point[1])] for point in mask]
# Create polyline with correct nesting
polyline = fo.Polyline(
points=[points_list], # Each mask is one shape
closed=True,
filled=True
)
polylines.append(polyline)
# Save to sample if we have valid polylines
if polylines:
polylines_field = fo.Polylines(polylines=polylines)
sample["fastsam_segmentation"] = polylines_field
sample.save()
Note, we are calling sample.save()
after adding predictions to each Sample. This method persists your changes to the FiftyOne database, ensuring that your generated segmentation masks are stored and accessible for future use.
dataset.reload()
dataset
dataset.first()['fastsam_segmentation']
Summary¶
In this tutorial, you've learned how to:
Implement zero-shot segmentation using different approaches without training custom models
Use text prompts with Florence2 for phrase grounding segmentation
Combine Moondream2's automatic keypoint detection with SAM2 for efficient segmentation
Leverage FastSAM for point-based segmentation tasks
Key Takeaways¶
Zero-shot segmentation enables quick implementation of segmentation tasks without labeled training data
Different prompting methods (text, points, hybrid) offer flexibility for various use cases
FiftyOne plugins can significantly extend your computer vision capabilities
Combining multiple models (like Moondream2 + SAM2) can create powerful workflows
Next Steps¶
Check out some more FiftyOne Plugins
Check out SAM2 in the FiftyOne Model Zoo
Check out MedSAM2 in the FiftyOne Model Zoo
Learn how to evaluate segmentations with the FiftyOne Evaluation API