Released in January 2021, OpenAI’s Contrastive Language-Image Pretraining (CLIP) model revolutionized image classification. Using CLIP, you can compare the similarity of an arbitrary text prompt and an image, or the similarity of two images. For example, given a video featuring different scenes, you can use CLIP to see if the scene is in a room, set in a park, set in a warehouse, et cetera.

CLIP is a zero-shot classification model, which means no prior training is required to classify images. CLIP performs well with general concepts and objects, but struggles with more specific objects. We wanted to test CLIP and GPT-4V side-by-side for specialized classification to see how each model performs.

In this guide, we are going to run three tests. We will evaluate how CLIP and GPT-4V perform for:

  1. Car brand classification
  2. Cup classification
  3. Pizza type classification

Without further ado, let’s get started!

Our Methodology: How We Run Tests

In this guide, we will source images from Google and run them through CLIP and GPT-4V.

CLIP returns probabilities for each class whereas GPT-4 does not do this (GPT-4V is not a traditional classification model in its outputs). To evaluate results, we will take the prediction with the highest probability from CLIP and ask GPT-4V for one single classification result.

Our GPT-4V prompt is:

What is in the image? Return the class of the object in the image. Here are the classes: CLASSES. You can only return one class from that list.

Where “CLASSES” is a comma-separated list of classes.

To run inference on CLIP and GPT-4V, we will use Autodistill. We have developed Autodistill modules for using CLIP and GPT-4V which enable you to use both models in the same script and with a few lines of code.

We have created a folder of scripts in the Roboflow GPT-4V experiments GitHub repository that you can use to both evaluate how we run our tests and to run your own tests.

Test #1: Car Brand Classification

Suppose you are building an application to sell used cars. Being able to take photos of the car and infer information about the car – its color, the brand, the number of seats – would make for a convenient data entry experience, in contrast to having a long form to fill out.

Let’s focus on identifying the car brand. You could potentially use CLIP or GPT-4V to classify the brand of the car as part of a larger system. Let’s test CLIP and GPT-4V to see how they perform.

Consider this photo of a Toyota Camry:

We are going to prompt CLIP and GPT-4V with two options:

  • Tesla Model 3
  • Toyota Camry

If GPT-4V is not sure, we will ask the model to return None.

The results are as follows:

  • CLIP result:  Toyota Camry
  • GPT-4-V result:  Toyota Camry

Both models performed well at identifying the car.

Let’s try another car, a Toyota Prius:

Our prompt options are:

  • Tesla Model 3
  • Toyota Camry
  • Toyota Prius

Here are the results:

  • CLIP result:  Toyota Prius
  • GPT-4-V result:  Toyota Prius

Both models successfully identified the car brand.

Test #2: Cup Classification

When working with zero-shot object detection models, we have observed that models sometimes struggle to distinguish between similar objects that often appear in the same context. 

When experimenting with Grounding DINO, a zero-shot object detection model, we found that the model struggled to differentiate between types of cups, types of trash, and other similar objects.

We have not experienced this issue specifically with CLIP, but we wanted to probe how GPT-4V would perform at identifying features of objects that are similar and document our results.

Here is the image we used as a test:

We created two prompts:

  1. Plastic cup
  2. Ceramic cup

We fed these prompts into CLIP and GPT-4V. Here were the results:

  • CLIP result:  ceramic cup
  • GPT-4-V result:  ceramic cup

Both models successfully identified the material used in the cup. One potential application of this would be in recycling sorting. You could use CLIP or GPT-4V to identify the constituent material of an object (ceramic vs. plastic vs. glass) to make a determination about how the model performs.

Test #3: Pizza Test

To the human eye, a well-made Chicago deep dish pizza is easy to identify with cues like the pan the pizza is served in; the depth of the pizza is a big giveaway. We wondered: how would CLIP and GPT-4V perform at this classifying deep dish versus regular pizza?

We sourced the following image:

One application of this system is in restaurants. You could have a camera that looks at pizza in a restaurant and checks that the right pizza is sent to the right place prior to being served to a customer.

So, how do the models perform?

Here are the results:

  • CLIP result:  chicago deep dish pizza
  • GPT-4-V result:  chicago deep dish pizza

Both models were able to successfully classify the type of pizza.

CLIP vs. GPT-4V: Overall Impressions

The above tests have one feature in common: rather than distinguishing between two far away objects (a cat vs. a dog), we wanted to analyze how CLIP and GPT-4V compare when classifying similar objects.

We were surprised by the extent to which CLIP was able to perform well in our tests. CLIP is a capable model but in all three tests CLIP was successful in identifying objects. In the case of GPT-4V, we were not sure what to expect. We were excited by how GPT-4V performed.

While CLIP and GP4-V achieved equivalent performance in our tests, the models have different deployment considerations. You can run CLIP almost in real time on a device like a Mac or a device with a CUDA-enabled GPU. CLIP runs locally. GPT-4V, in contrast, requires sending a request to an external API that is hosted by OpenAI. Making such requests to an external service will have some overhead.

We encourage you to experiment with CLIP and GPT-4V using our scripts to explore its performance on your own tasks. Tag Roboflow on LinkedIn or Twitter if you post about your results – we are curious to see what experiments people do!