Discover the potential of Meta AI’s Segment Anything Model (SAM) in this comprehensive tutorial. We dive into SAM, an efficient and promptable model for image segmentation. With over 1 billion masks on 11M licensed and privacy-respecting images, SAM’s zero-shot performance is often competitive with or even superior to prior fully supervised results. For more information on how SAM works and the model architecture, read our SAM technical deep dive.

In this written tutorial (and the video below), we will explore how to use SAM to generate masks automatically, create segmentation masks using bounding boxes, and convert object detection datasets into segmentation masks. If you're interested in using SAM to label data for computer vision, Roboflow Annotate uses SAM to power automated polygon labeling in the browser which you can try for free.

In object detection, objects are often represented by bounding boxes, which are like drawing a rectangle around the object. These rectangles give a general idea of the object's location, but they don't show the exact shape of the object. They may also include parts of the background or other objects inside the rectangle, making it difficult to separate objects from their surroundings.

Segmentation masks, on the other hand, are like drawing a detailed outline around the object, following its exact shape. This allows for a more precise understanding of the object's shape, size, and position.

Figure showing the difference between detection by bounding box (left) and segmentation (right). Note how a large part of the bounding box is not really related to detection.

To use Segment Anything on a local machine, we'll follow these steps:

  1. Set up a Python environment
  2. Load the Segment Anything Model (SAM)
  3. Generate masks automatically with SAM
  4. Plot masks onto an image with Supervision
  5. Generate bounding boxes from the SAM results
💡
In July 2023, a separate team of researchers released FastSAM, trained on 2% of the Segment Anything SA1-B dataset. While not as accurate as SAM, FastSAM is considerably faster than SAM. Read our analysis of the FastSAM model.

Setting up Your Python Environment

To get started, open the Roboflow notebook in Google Colab and ensure you have access to a GPU for faster processing. Next, install the required project dependencies and download the necessary files, including SAM weights.

pip install \
'git+https://github.com/facebookresearch/segment-anything.git'
pip install -q roboflow supervision
wget -q \
'https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth'

Loading the Segment Anything Model

Once your environment is set up, load the SAM model into the memory. With multiple modes available for inference, you can use the model to generate masks in various ways. We will explore automated mask generation, generating segmentation masks with bounding boxes, and converting object detection datasets into segmentation masks.

The SAM model can be loaded with 3 different encoders: ViT-B, ViT-L, and ViT-H. ViT-H improves substantially over ViT-B but has only marginal gains over ViT-L. These encoders have different parameter counts, with ViT-B having 91M, ViT-L having 308M, and ViT-H having 636M parameters. This difference in size also influences the speed of inference, so keep that in mind when choosing the encoder for your specific use case.

import torch
from segment_anything import sam_model_registry

DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
MODEL_TYPE = "vit_h"

sam = sam_model_registry[MODEL_TYPE](checkpoint=CHECKPOINT_PATH)
sam.to(device=DEVICE)

Automated Mask (Instance Segmentation) Generation with SAM

To generate masks automatically, use the SamAutomaticMaskGenerator. This utility generates a list of dictionaries describing individual segmentations. Each dict in the result list has the following format:

  • segmentation - [np.ndarray] - the mask with (W, H) shape, and bool type, where W and H are the width and height of the original image, respectively
  • area - [int] - the area of the mask in pixels
  • bbox - [List[int]] - the boundary box detection in xywh format
  • predicted_iou - [float] - the model's own prediction for the quality of the mask
  • point_coords - [List[List[float]]] - the sampled input point that generated this mask
  • stability_score - [float] - an additional measure of mask quality
  • crop_box - List[int] - the crop of the image used to generate this mask in xywh format

To run the code below you will need images. You can use your own, programmatically pull them in from Roboflow, or download one of the over 200k datasets available on Roboflow Universe.

import cv2
from segment_anything import SamAutomaticMaskGenerator

mask_generator = SamAutomaticMaskGenerator(sam)

image_bgr = cv2.imread(IMAGE_PATH)
image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
result = mask_generator.generate(image_rgb)

The supervision package (starting from version 0.5.0) provides native support for SAM, making it easier to annotate segmentations on an image.

import supervision as sv

mask_annotator = sv.MaskAnnotator(color_map = "index")
detections = sv.Detections.from_sam(result)
annotated_image = mask_annotator.annotate(image_bgr, detections)
Figure showing the original (left) and segmented (right) image.
Figure showing all obtained segmentations separately.

Generate Segmentation Mask with Bounding Box

Now that you know how to generate a mask for all objects in an image, let’s see how you can use a bounding box to focus SAM on a specific portion of your image.

To extract masks related to specific areas of an image, import the SamPredictor and pass your bounding box through the mask predictor’s predict method. Note that the mask predictor has a different output format than the automated mask generator. The bounding box format for the SAM model should be in the form of [x_min, y_min, x_max, y_max] np.array.

import cv2
from segment_anything import SamPredictor

mask_predictor = SamPredictor(sam)

image_bgr = cv2.imread(IMAGE_PATH)
image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
mask_predictor.set_image(image_rgb)

box = np.array([70, 247, 626, 926])
masks, scores, logits = mask_predictor.predict(
    box=box,
    multimask_output=True
)
Figure showing the original image with bounding box (left) and segmented (right) image.

Convert Object Detection Datasets into Segmentation Masks

To convert bounding boxes in your object detection dataset into segmentation masks, download the dataset in COCO format and load annotations into the memory.

If you don't have a dataset in this format, Roboflow Universe is the ideal place to find and download one. Now you can use the SAM model to generate segmentation masks for each bounding box. Head over to the Google Colab where you will find the code to convert from bounding box to segmentation.

Conclusion

The Segment Anything Model offers a powerful and versatile solution for object segmentation in images, enabling you to enhance your datasets with segmentation masks.

With its fast processing speed and various modes of inference, SAM is a valuable tool for computer vision applications. To experience labeling your data with SAM, you can use Roboflow Annotate which offers an automated polygon annotation tool, Smart Polygon, powered by SAM.