Zero-Shot Image Annotation with Grounding DINO and SAM - A Notebook Tutorial

Annotating images for object detection or instance segmentation can be a tedious and expensive process. However, recent advancements in AI have paved the way for quicker and more accurate annotation methods. Grounding DINO and Segment Anything Model (SAM) are two state-of-the-art models that can considerably speed up this process. In this comprehensive blog post, we will demonstrate how these models can be utilized for image annotation and the conversion of object detection datasets into instance segmentation datasets.

We have prepared a Jupyter Notebook that you can open locally or in Google Colab and follow along while reading the blog post.

Let’s dive into the details!

💡
You can train small models using Grounding DINO in a few lines of code and without labeling data using Autodistill, an open-source ecosystem for distilling large foundation models into smaller models trained on your data.

Check out our Autodistill guide for more information, and our Autodistill Grounding DINO documentation.

Annotating Images with Grounding DINO and SAM

Grounding DINO and SAM are powerful AI models that can assist in the dataset annotation process. Grounding DINO is capable of zero-shot detection of any object in the image, while SAM can convert these bounding boxes into instance segmentation masks.

We have previously published blog posts that cover the details of these models and provide step-by-step instructions on how to install and use them. To save time, we encourage you to refer to these blog posts for in-depth information and guidance: Grounding DINO: Zero-Shot Object Detection and How to Use Segment Anything Model (SAM).

Using Grounding DINO to Generate Bounding Boxes

First, prepare the image you want to annotate. Next, use the Grounding DINO model to generate bounding boxes around the objects present in the image. These initial bounding boxes will act as a starting point for the instance segmentation process.

Source image on the left and Grounding DINO inference results on the right.

Using SAM to Convert Bounding Boxes into Instance Segmentation Masks

With the bounding boxes in place, you can now use the SAM model to convert them into instance segmentation masks. The SAM model takes the bounding box data as input and outputs precise segmentation masks for each object in the image.

SAM inference input prompts on the left and SAM inference results on the right.

Dataset Processing and Saving

After obtaining the initial instance segmentation masks, it’s essential to clean up the data. This may involve removing duplicate detections, merging overlapping polygons, or splitting polygons that cover multiple objects. Careful data cleaning ensures the final annotations are accurate and ready for various purposes.

Raw masks (before conversion to Pascal VOC XML) visualized in separate images.

We save the generated masks for two reasons:

  • To refine the annotations in an annotation tool like Roboflow Annotate, which allows for easy editing and validation of the data.
  • To use the annotations for training other models, such as real-time object detectors like YOLOv8, by providing them with a solid foundation of annotated data.

For this tutorial, we will save the detections in the Pascal VOC XML format, which is compatible with many annotation tools and machine learning frameworks.

When saving the annotations, you can adjust the following parameters to fine-tune the results:

  • min_image_area_percentage: This parameter defines the minimum area percentage of a mask in relation to the total image area. Masks with an area smaller than this percentage will be discarded. This helps to filter out small, potentially noisy detections.
  • max_image_area_percentage: Similarly, this parameter defines the maximum area percentage of a mask in relation to the total image area. Masks with an area larger than this percentage will be discarded. This helps to filter out large, potentially erroneous detections.
  • approximation_percentage: This parameter controls the simplification of the mask polygons. A higher percentage results in fewer points in the polygon and a more simplified shape, while a lower percentage preserves more details of the original mask. Adjusting this value can help balance the trade-off between the accuracy and complexity of the mask shapes.

Here’s an example of how to save the annotations using the supervision package:

import supervision as sv

dataset = sv.Dataset(...)

dataset.as_pascal_voc(
    annotations_directory_path='path/to/annotations/directory',
    min_image_area_percentage=0.002,
    max_image_area_percentage=0.80,
    approximation_percentage=0.75
)

By adjusting these parameters, you can control the quality of the automated annotations and fine-tune the results to meet your specific needs.

Editing Annotations in Roboflow Annotate

After the automated labeling process is complete, you’ll want to clean up the data to ensure the annotations meet your needs. Roboflow Annotate provides a web-based platform for this task, with one-click polygon labeling powered by SAM.

Automatically generated annotations after being loaded into Roboflow UI.

Converting Object Detection Datasets into Instance Segmentation Datasets

If you already have a dataset annotated with bounding boxes and want to convert it into an instance segmentation dataset, you can use SAM standalone because the model can process existing bounding boxes.

First, download your object detection dataset into your Python environment. If you store your dataset in Roboflow, you can export it manually or use roboflow package to do that. Then, loop over the bounding boxes for each image, querying SAM for each one separately to generate an instance segmentation mask. Finally, save the resulting masks and annotations.

Source image with bounding box annotations on left and same image with masks generated by SAM on right.

Conclusions

The quality of obtained labels using Grounding DINO and SAM largely depends on the classes you want to annotate. You can expect excellent results for common classes that appear in popular datasets. However, for obscure classes, the quality of annotations might be poor. To assess the effectiveness of these models for your specific use case, it is recommended to start with a small, representative set of images from your dataset.

It’s important to note that, despite the latest breakthroughs in the computer vision space, we are still not ready to automate the annotation process fully. Human validation and curation are essential to ensure optimal outcomes. Nonetheless, leveraging models like Grounding DINO and SAM can result in significant time savings when annotating an entire dataset, allowing you to focus on refining the results and training more accurate machine learning models.

Exciting Announcement 🔥

We’re thrilled to announce that we are developing a Python library to streamline the process of transferring knowledge from powerful zero-shot computer vision models like Grounding DINO, SAM, CLIP, and others to real-time detectors like YOLOv8. This innovation will revolutionize dataset annotation and accelerate CV projects. Stay tuned by subscribing to our weekly newsletter for the upcoming release, and don’t miss out on this exciting development in computer vision!