On November 6th, 2023, OpenAI released a vision-enabled version of the GPT-4 API. This API, referred to by the gpt-4-vision-preview identifier, enables you to ask a question and provide an image as context. We previously reported on GPT 4V's capabilities, noting impressive performance in image understanding. These capabilities are perfect for classification.

In this post, we are going to walk through Autodistill GPT-4V, an open source project that lets you use GPT-4V to automatically label data. You can then use the labeled data to train a smaller, fine-tuned model for your use case and deploy the model using the open source Roboflow Inference Server.

By the end of this post, we will train a model to classify fish, like this image:

Why Should I Distill GPT-4V for Classification?

By distilling GPT-4V for image classification, you can automatically label data then create a fine-tuned model that you can run on-device, without an internet connection, and without paying for each API request.

GPT-4V has an extensive set of knowledge about the world. In our initial tests, we asked questions about running OCR on a blog post mentioning Taylor Swift’s music, computer vision memes, math OCR, and other questions. 

You can ask broad questions about images. For example, given an image, you could ask “is this image a shipping container or something else?”

You can also ask GPT-4V more specific questions. For example, you could say “is the fish in the image below a salmon, pike, or trout.”

If you ask similar questions over a whole dataset of images, you can assign labels to each image. These labels can then be formatted as a classification dataset, which can be used for training a smaller model, like Ultralytics YOLOv8 Classification.

How to Label Images with GPT-4V for Classification

Autodistill is an ecosystem of classification, detection, and segmentation foundation models that have a broad range of knowledge. You can use these foundation models from CLIP to Grounding DINO to Segment Anything to label images. You can do this in a few lines of code. You can choose a “target model” such as YOLOv8 to train a model.

Autodistill GPT-4V enables you to provide an image, retrieve a classification result, then save the results in a classification folder dataset.

To get started, first install Autodistill and Autodistill GPT-4V:

pip install autodistill autodistill-gpt4v

If you do not have an OpenAI account, create an account and add credit card information to your account. This is required because the GPT-4V API costs money to use. See the OpenAI Pricing page for more information about how much usage of the API will cost.

Create an API key in the OpenAI dashboard. Run the following command to set your API key:

export OPENAI_API_KEY=your-key

Next, create a new Python file called `distill.py` and add the following code;

from autodistill_gpt_4v import GPT4V
from autodistill.detection import CaptionOntology

base_model = GPT4V(
    ontology=CaptionOntology(
        {
            "salmon": "salmon",
            "carp": "carp"
        }
    )
)

result = base_model.predict("fish.jpg", base_model.ontology.prompts())

class_result = base_model.ontology.prompts()[result.get_top_k(1).class_id]
print(class_result)

A caption ontology is used to define what we want to classify. The ontology accepts a dictionary with an arbitrary number of classes. The key in the dictionary is the prompt that will be sent to the foundation model (in this case, GPT-4V). The value is the label that will be saved in a dataset if you choose to label a whole folder of images (more on that below).

This code will make one API request per image.

Let’s run the code on the following image:

GPT-4V successfully classified the image as “salmon”.

Note: GPT-4V does not return classification confidences. If a class is identified, the confidence will be 1; otherwise, the confidence will be 0.

Once you have experimented with prompts and evaluated how GPT-4V performs on your use case, you can label a folder of images. You can do so using the following code:

base_model.label(“./context_images/”, extension=”.jpeg”)

This will run GPT-4V on every image in the folder and save the results in a classification folder dataset.

You can then train a model with the dataset. For example, you can train a YOLOv8 classification model. To do so, first install the YOLOv8 Autodistill module.

pip install autodistill-yolov8

Next, create a new Python script and add the following lines of code:

from autodistill_yolov8 import YOLOv8

target_model = YOLOv8("yolov8n.pt")

# train a model
target_model.train("./context_images_labeled/data.yaml", epochs=200)

# export weights for future use
saved_weights = target_model.export(format="onnx")

# show performance metrics for your model
metrics = target_model.val()

# run inference on the new model
pred = target_model.predict("./context_images_labeled/valid/images/fish-7.jpg", conf=0.01)

Once you run this code, a YOLOv8 model will be trained. Then, inference will be run on an image in the validation set. You can now deploy the YOLOv8 model locally using Roboflow Inference.

Conclusion (and Inspiration)

Ideas are abound for using GPT-4V via API for vision related tasks, covering:

  1. Vehicle classification (is this a Camry or a Jeep?)
  2. Spatial classification (does this image contain a food tray that is empty?)
  3. Damage detection (does this vehicle contain a scratch, a dent, or no damage?)

We encourage you to explore GPT-4V and test its capabilities. Share your results so that the community can better evaluate how GPT-4V can help solve a business problem.