Use Gemini 2.5 for Zero-Shot Object Detection & Segmentation

Detecting and segmenting objects in an image requires fine-tuning to perform well on specific use cases. This process is time-consuming, resource-intensive, and often inflexible when working with new or uncommon object classes. But for some use cases, multimodal models increasingly can help.

Gemini 2.5, a multimodal language model developed by Google, supports zero-shot object detection and segmentation. You can send an image to Gemini 2.5 and ask for bounding boxes or segmentation masks corresponding to objects of interest. This model performs best on common objects.

In this guide, we will show how to use Gemini 2.5 for zero-shot object detection and segmentation.

💡

Use this Colab notebook as you read the guide to run code snippets and faster development

Let's get started!

Environment Setup

In this guide, we'll be making a program that uses Gemini 2.5 as a VLM, asking it to determine bounding boxes and segmentation masks on an image. If you refer to the colab notebook, you'll find a link to your Google API Console. In there, create a new project and create a new API key.

Next, we'll install the dependencies required and store our API key in an environment variable.

Before proceeding, we need to enable the Generative Language API in our project on Google API Console.

0:00

/0:05

From here, create a file called .env on your computer. This where we'll store our key. You can find the key by clicking on the "Show key" button in the credentials tab of your project.

You can then store this key in a variable in .env by adding the following line in the file.

GOOGLE_API_KEY="<<YOUR API KEY>>"

To be able to access the Gemini API client, open a new terminal and execute:

pip install google-genai supervision==0.26.0rc8

This will install both google-generativeai and supervision, packages we'll need for running the model and overlaying annotations.

Next, we'll need an image to run the model with. The colab notebook provides some sample images by executing 2 commands, however, if you wish to run them on different images, you can follow this guide showing how to upload your own files to google colab.

If you're running your project locally, create a new project in a code editor and add some sample images in the same directory. In my local project, I used a stock image of motorcyclists found on pexels, with the goal to detect their helmets and draw masks on the drivers and motorcycles themselves.

Sample image

Now, we're ready to implement zero-shot detection and segmentation!

Gemini 2.5 Object Detection Implementation

The first step for creating the application would be setting up the Gemini client, which is done by either executing the snippet in the notebook or adding the following code to a new file, main.py, in your local project:

import os
from dotenv import load_dotenv
from google import genai
from google.genai import types

load_dotenv()

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH",
    ),
]

MODEL_NAME = "gemini-2.5-flash-preview-05-20"
TEMPERATURE = 0.5

IMAGE_PATH = "sample.jpg"
PROMPT = "Detect the helmets. " + \
"Output a JSON list of bounding boxes where each entry contains the 2D bounding box in the key \"box_2d\", " + \
"and the text label in the key \"label\". Use descriptive labels."\

This code mimics that of the notebook, except here were using the dotenv library to access the environment variable. Additionally, depending on what you aim to detect with object detection, the first part of the prompt will vary.

This example detects the helmets of the motorcyclists, but you'll have to change it to match whatever image you've provided. Don't forget to change the image path for both the notebook and your local repo to whatever images you're using.

Now, we'll have to prepare the image that we plan to use, and then we can run it! The notebook makes this very straightforward, allowing a single snippet to prepare the image for any future executions. Locally, you'll have to install Pillow if you haven't already by executing:

pip install Pillow

From here, import this library at the top of your script and add the snippet code to the file:

import os
from dotenv import load_dotenv
from google import genai
from google.genai import types
from PIL import Image

load_dotenv()

# Load Gemini 2.5 and prompt
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH",
    ),
]

MODEL_NAME = "gemini-2.5-flash-preview-05-20"
TEMPERATURE = 0.5

IMAGE_PATH = "sample.jpg"
PROMPT = "Detect the motorcycles." + \
"Output a JSON list of bounding boxes where each entry contains the 2D bounding box in the key \"box_2d\", " + \
"and the text label in the key \"label\". Use descriptive labels."

# Image and response
image = Image.open(IMAGE_PATH)
width, height = image.size
target_height = int(1024 * height / width)
resized_image = image.resize((1024, target_height), Image.Resampling.LANCZOS)

response = client.models.generate_content(
    model=MODEL_NAME,
    contents=[PROMPT, resized_image],
    config = types.GenerateContentConfig(
        temperature=TEMPERATURE,
        safety_settings=safety_settings,
        thinking_config=types.ThinkingConfig(
          thinking_budget=0
        )
    )
)

print(response.text)

The snippet we just added gets a response by feeding in the prompt and image with the Gemini client we initialized with genai. Overall, execution should print out a response including the coordinates/measurements of bounding boxes on the image in JSON format, as instructed by the prompt.

The last step for object detection is displaying the bounding boxes we got from Gemini on the input image.

Supervision, a library for computer vision tools, allows us to do this easily. The notebook snippet imports the lib, as well as formatting the bounding boxes/text. Merging it with main.py, our code is:

```{python}
import os
from dotenv import load_dotenv
from google import genai
from google.genai import types
from PIL import Image
import supervision as sv

load_dotenv()

# Load Gemini 2.5 and prompt
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH",
    ),
]

MODEL_NAME = "gemini-2.5-flash-preview-05-20"
TEMPERATURE = 0.5

IMAGE_PATH = "sample.jpg"
PROMPT = "Detect the motorcycles." + \
"Output a JSON list of bounding boxes where each entry contains the 2D bounding box in the key \"box_2d\", " + \
"and the text label in the key \"label\". Use descriptive labels."

# Image and response
image = Image.open(IMAGE_PATH)
width, height = image.size
target_height = int(1024 * height / width)
resized_image = image.resize((1024, target_height), Image.Resampling.LANCZOS)

response = client.models.generate_content(
    model=MODEL_NAME,
    contents=[PROMPT, resized_image],
    config = types.GenerateContentConfig(
        temperature=TEMPERATURE,
        safety_settings=safety_settings,
        thinking_config=types.ThinkingConfig(
          thinking_budget=0
        )
    )
)

# Overlay image
resolution_wh = image.size

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.GOOGLE_GEMINI_2_5,
    result=response.text,
    resolution_wh=resolution_wh
)

thickness = sv.calculate_optimal_line_thickness(resolution_wh=resolution_wh)
text_scale = sv.calculate_optimal_text_scale(resolution_wh=resolution_wh)

box_annotator = sv.BoxAnnotator(thickness=thickness)
label_annotator = sv.LabelAnnotator(
    smart_position=True,
    text_color=sv.Color.BLACK,
    text_scale=text_scale,
    text_position=sv.Position.CENTER
)

annotated = image
for annotator in (box_annotator, label_annotator):
    annotated = annotator.annotate(scene=annotated, detections=detections)

sv.plot_image(annotated)
```

Here, both the box annotator and label annotator tools are used. You can check out the detailed documentation for all the tools offered to fully control your visuals.

With this, running main.py/executing the visualization snippet should show the detections on the input image. You can now also change the prompt to change what the model detects, as well as the sample image.

Annotated image

Gemini 2.5 Instance Segmentation Implementation

Because the tasks at hand are all zero-shot, you can implement different kind of predictions by changing prompt and visualization code. The notebook snippets describe the same steps we took for object detection, except modifying the second part of the prompt to output segmentation masks in JSON, rather than bounding boxes. here is what main.py looks like, incorporating the segmentation snippets:

import os
from dotenv import load_dotenv
from google import genai
from google.genai import types
from PIL import Image
import supervision as sv

load_dotenv()

# Load Gemini 2.5 and prompt
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH",
    ),
]

MODEL_NAME = "gemini-2.5-flash-preview-05-20"
TEMPERATURE = 0.5

IMAGE_PATH = "sample.jpg"

# Object detection prompt
# PROMPT = "Detect the helmets." + \
# "Output a JSON list of bounding boxes where each entry contains the 2D bounding box in the key \"box_2d\", " + \
# "and the text label in the key \"label\". Use descriptive labels."

# Instance segmentation prompt
PROMPT = "Give the segmentation masks for the motorcycles." + \
"Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", " + \
"the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."

# Image and response
image = Image.open(IMAGE_PATH)
width, height = image.size
target_height = int(1024 * height / width)
resized_image = image.resize((1024, target_height), Image.Resampling.LANCZOS)

response = client.models.generate_content(
    model=MODEL_NAME,
    contents=[PROMPT, resized_image],
    config = types.GenerateContentConfig(
        temperature=TEMPERATURE,
        safety_settings=safety_settings,
        thinking_config=types.ThinkingConfig(
          thinking_budget=0
        )
    )
)

# Overlay image
resolution_wh = image.size

detections = sv.Detections.from_vlm(
    vlm=sv.VLM.GOOGLE_GEMINI_2_5,
    result=response.text,
    resolution_wh=resolution_wh
)

thickness = sv.calculate_optimal_line_thickness(resolution_wh=resolution_wh)
text_scale = sv.calculate_optimal_text_scale(resolution_wh=resolution_wh) / 3

box_annotator = sv.BoxAnnotator(thickness=thickness)
label_annotator = sv.LabelAnnotator(
    smart_position=True,
    text_color=sv.Color.BLACK,
    text_scale=text_scale,
    text_position=sv.Position.CENTER
)

# Object detection annotations
# annotated = image
# for annotator in (box_annotator, label_annotator):
#     annotated = annotator.annotate(scene=annotated, detections=detections)

# sv.plot_image(annotated)

# Instance segemntation annotations
masks_annotator = sv.MaskAnnotator()

annotated = image
for annotator in (box_annotator, label_annotator, masks_annotator):
    annotated = annotator.annotate(scene=annotated, detections=detections)

sv.plot_image(annotated)

In the snippet we added, and in the notebook, you'll change the visualizing for masks by including the mask annotator from supervision, as well as adding it to the annotation loop.

Running the visualization snippet/the program should result in:

And with this, we've successfully used Gemini 2.5 for zero-shot detection and segmentation.

Conclusion

Congratulations! You have successfully used Gemini 2.5 and Supervision to perform zero-shot object detection and instance segmentation—all without writing a single training loop or collecting labeled data. By simply changing the prompt and re-running the script, you can detect and segment entirely new objects, making this approach incredibly flexible and efficient.