Image classification is a computer vision task that aims to assign one or multiple labels to an image. For many years, image classification, even for common objects such as fruit, involved training a custom vision model, such as a ResNet model, for the specific task. Then, zero-shot classification models arrived, which enable you to classify images without training a model.

In this guide, we are going to discuss what zero-shot classification is, the applications of zero-shot classification, popular models, and how to use a zero-shot classification model. Without further ado, let’s get started!

What is Zero-Shot Classification?

Zero-shot classification models are large, pre-trained models that can classify images without being trained on a particular use case.

One of the most popular zero-shot models is the Contrastive Language-Image Pretraining (CLIP) model developed by OpenAI. Given a list of prompts (i.e. “cat”, “dog”), CLIP can return a similarity score which shows how similar the embedding calculated from each text prompt is to an image embedding. You can then take the highest confidence as a label for the image.

CLIP was trained on over 400 million pairs of images and text. Through this training process, CLIP developed an understanding of how text relates to images. Thus, you can ask CLIP to classify images by common objects (i.e. “cat”) or by a characteristic of an image (i.e. “park” or “parking lot”). Between these two capabilities lie many possibilities.

Consider the following image:

This image features a Toyota Camry. When passed through CLIP with the prompts “Toyota” and “Mercedes”, CLIP successfully identified the model as a Toyota car. Here were the results from the model, rounded to five decimal places:

  • Toyota: 0.9989
  • Mercedes: 0.00101

The higher the number, the more similar the embedding associated with a text prompt is to the image embedding.

Notably, we did not train or fine-tune this model for car brand classification; out of the box, a zero-shot model was able to solve our problem.

Consider this image:

This image features a billboard. Let’s run CLIP with five classes: “billboard”, “traffic sign”, “sign”, “poster”, and “something else”. Here are the results:

  • Billboard: 0.96345
  • traffic sign: 0.01763
  • sign: 0.01548
  • poster: 0.00207
  • something else: 0.00137

The text prompt embedding with the highest similarity to the image embedding was “billboard”. Although a billboard is technically a poster and we provided “poster” as a class, the semantics in the embeddings we calculated with CLIP encoded that the image contained a billboard rather than a generic poster. Thus, the similarity score for the “billboard” class is higher than “poster”.

“Something else” is a common background class provided in prompts, since you need to provide two or more prompts in classification.

Zero-Shot Classification Applications

For classifying common scenes like identifying if an image contains a person, if a person is wearing a mask, or if an image contains a billboard, zero-shot models may be used out of the box, without any fine-tuning. This enables you to integrate vision into an application significantly faster; you can eliminate the time and cost required to train a model.

You can use CLIP on video frames, too. For example, you could use CLIP to identify when a person comes on a security camera at night, and use that information to flag to a security officer that a person has entered the scene. Or you could use CLIP to identify when a box is or is not present on a conveyor belt.

0:00
/0:12

You can also use a zero-shot model like CLIP to label data for use in training a smaller, fine-tuned model. This is ideal if a zero-shot model only performs well sometimes and you need higher accuracy or lower latency. You can use Autodistill and the Autodistill CLIP module to automatically label data using CLIP for use in training a fine-tuned classification model.

Learn more how to train a classification model with no labeling.

In this post, we have mentioned CLIP frequently. CLIP is used for many zero-shot classification tasks. With that said, there are other models available. Many models use and improve on the CLIP architecture developed by OpenAI in 2021.

For example, Meta AI Research released MetaCLIP in September 2023, a version of CLIP that has an open training data distribution, unlike the closed-source dataset used to train CLIP. AltCLIP was trained on multiple languages, enabling users to provide multilingual prompts.

Other popular zero-shot models include:

How to Use Zero-Shot Classification Models

Let’s walk through an example that shows how to use CLIP to classify an image. For this guide, we are going to use a hosted version of Roboflow Inference, a tool that enables you to run large foundation vision models as well as fine-tuned models.

We will build an application that lets you run CLIP on an image. We will run inference on the hosted Roboflow CLIP endpoint, which enables you to run CLIP inference in the cloud.

Create a new Python file and add the following code:

import requests
import base64
from PIL import Image
from io import BytesIO
import os

INFERENCE_ENDPOINT = "https://infer.roboflow.com"
API_KEY = "API_KEY"

prompts = [
    "orange",
    "apple",
    "banana"
]

def classify_image(image: str) -> dict:
    image_data = Image.open(image)

    buffer = BytesIO()
    image_data.save(buffer, format="JPEG")
    image_data = base64.b64encode(buffer.getvalue()).decode("utf-8")

    payload = {
        "api_key": API_KEY,
        "subject": {
            "type": "base64",
            "value": image_data
        },
        "prompt": prompts,
    }

    data = requests.post(INFERENCE_ENDPOINT + "/clip/compare?api_key=" + API_KEY, json=payload)

    return data.json()

def get_highest_prediction(predictions: list) -> str:
    highest_prediction = 0
    highest_prediction_index = 0

    for i, prediction in enumerate(predictions):
        if prediction > highest_prediction:
            highest_prediction = prediction
            highest_prediction_index = i

    return prompts[highest_prediction_index]

In the code above, replace:

  1. API_KEY with your Roboflow API key. Learn how to retrieve your Roboflow API key.
  2. prompts with the prompts you want to use in prediction.

Then, add the following code:

image = "image.png"
predictions = classify_image(image)
print(get_highest_prediction(predictions["similarity"]), image)

Let’s run inference on the following image of a shirt with the prompts “shirt” and “sweatshirt”:

The class with the highest similarity is “sweatshirt”. We successfully classified the image with CLIP.

You can also run CLIP on frames from videos. Learn more about how to analyze videos with CLIP.

Conclusion

Zero-shot classification models play a key role in computer vision tasks. You can use zero-shot classification models out of the box for your application, or to label images. You can also use zero-shot classification models to analyze video frames. Many applications use CLIP as a starting point, which performs well across a wide range of tasks.

Now you have all the knowledge you need to start using zero-shot computer vision models!