CLIP is a gigantic leap forward, bringing many of the recent developments from the realm of natural language processing into the mainstream of computer vision: unsupervised learning, transformers, and multimodality to name a few. The burst of innovation it has inspired shows its versatility.

And this is likely just the beginning. There has been scuttlebutt recently about the coming age of "foundation models" in artificial intelligence that will underpin the state of the art across many different problems in AI; I think CLIP is going to turn out to be the bedrock model for computer vision.

In this post, we aim to catalog the continually expanding use-cases for CLIP; we will update it periodically.

Prefer video content? Subscribe to our YouTube channel.

What is CLIP?

In a nutshell, CLIP is a multimodal model that combines knowledge of English-language concepts with semantic knowledge of images.

It can just as easily distinguish between an image of a "cat" and a "dog" as it can between "an illustration of Deadpool pretending to be a bunny rabbit" and "an underwater scene in the style of Vincent Van Gogh" (even though it has definitely never seen those things in its training data). This is because of its generalized knowledge of what those English phrases mean and what those pixels represent.

This is in contrast to traditional computer vision models which disregard the context of their labels (in other words, a "normal" image classifier works just as well if your labels are "cat" and "dog" or "foo" and "bar"; behind the scenes it just converts them into a numeric identifier with no particular meaning).

Imagine you're given a filing cabinet and 100,000 documents. Your job is to put them each into the correct folder out of the 1,000 folders in the cabinet. At the end of the day your boss will judge your work.

Unfortunately, you're illiterate. You start off doing no better than random chance. But, one day you realize that some of the documents are crisp and white and some are tattered and yellowed. You decide to sort the documents by color and split them evenly between the folders. Your boss is pleased and gives you slightly better marks that day. Day by day you try to discover new things that are different about the files: some are long and some are short. Some have photos and some do not. Some are paper-clipped and some are stapled.

Then, one day, after years and years of tirelessly deciphering this enigma, trying different combinations of folders and ways of dividing the documents, improving your performance bit by bit, your boss introduces you to your new coworker. You furrow your brow trying to figure out how you're going to train her to execute on your delicate and complicated system.

But, to your surprise, on her very first day, her performance exceeds yours! It turns out your new coworker is CLIP and she knows how to read. Instead of having to guess what the folders should contain she simply looks at their labels. And instead of discovering clues about the documents bit by bit, she already has prior knowledge of what those indecipherable glyphs represent.

In real world tasks, the "glyphs" are actually patterns of pixels (features) representing abstractions like colors, shapes, textures, and patterns (and even concepts like people and locations).

If you're interested in learning more about what CLIP is and how it works, check out our CLIP 101 post.

Use Cases

One of the neatest aspects of CLIP is how versatile it is. When introduced by OpenAI they noted two use-cases: image classification and image generation. But in the 9 months since its release it has been used for a far wider variety of tasks.

Image classification

OpenAI originally evaluated CLIP as a zero-shot image classifier. They compared it against traditional supervised machine learning models and it performed nearly on par with them without having to be trained on any specific dataset.

CLIP for Image Classification

One challenge with traditional approaches to image classification is that you need lots of training examples that closely resemble the distribution of the images it will see in the wild. Because of this, CLIP does better on this task the less training data there is available.

To try CLIP for image classification, follow our CLIP tutorial. If you're having trouble getting good results, read our tips on prompt engineering.

Image Generation

DALL-E was developed by OpenAI in tandem with CLIP. It's a generative model that can produce images based on a textual description; CLIP was used to evaluate its efficacy.

An image generated by CLIP+VQGAN.

The DALL-E model has still not been released publicly, but CLIP has been behind a burgeoning AI generated art scene. It is used to "steer" a GAN (generative adversarial network) towards a desired output. The most commonly used model is Taming Transformers' CLIP+VQGAN which we dove deep on here.

Content Moderation

One extension of image classification is content moderation. If you ask it in the right way, CLIP can filter out graphic or NSFW images out of the box. We demonstrated content moderation with CLIP in a post here.

An image drawn by a user of paint.wtf and flagged as NSFW by CLIP

Because CLIP doesn't need to be trained on specific phrases, it's perfectly suited for searching large catalogs of images. It doesn't need images to be tagged and can do natural language search.

Yurij Mikhalevich has already created an AI-powered command image line search tool called rclip. It wouldn't surprise me if CLIP spawns a Google Image Search competitor in the near future.

Using CLIP for image search with rclip.

Image Similarity

Apple's Neuralhash semantic image similarity algorithm has been in the news a lot recently for how they're applying it to scanning user devices for CSAM. We showed how you can use CLIP to find similar images in the exact same way Apple's Neuralhash works.

Similar image search; these images match even though the watermark differs.

The applications of being able to find similar images go far beyond scanning for illegal content, though. It could be used to search for copyright violations, create a clone of Tineye, or an advanced photo library de-duplicator.

Image Ranking

It's not just factual representations that are encoded in CLIP's memory. It also knows about qualitative concepts as well (as we learned from the Unreal engine trick).

Ranking images with CLIP.

We used this to create a CLIP judged Pictionary-style game, but you could also use it to create a camera app that "scores" users' photos by "searching" for phrases like "award winning photograph" or "professional selfie of a model" to help users decide which images to keep and which ones to trash, for example.

Object Tracking

As an extension of image similarity, we've used CLIP to track objects across frames in a video. It uses an object detection model to find items of interest then crops the image and uses CLIP to determine if two detected objects are the same or difference instance of that object across different frames of a video.

Robotics Control

The CLIPort model combines CLIP with another model to allow robots to perform abstract tasks like folding laundry or sorting cubes without having to be given explicit instructions for how to accomplish the objective.

Image Captioning

With the CLIP prefix captioning repo, the feature vectors from CLIP have been wired into GPT-2 to output an English description for a given image.

Example captions from CLIP + GPT2.

Deciphering Corrupted Images

In a new paper, called Inverse Problems Leveraging Pre-Trained Contrastive Representations, researchers have shown how CLIP can be used to interpret extremely distorted or corrupted images.

How to Use CLIP

You can use CLIP to classify images through Roboflow Inference, an open source computer vision inference server that runs on your hardware.

Let's show how to use CLIP on a webcam feed so that we can test the model in real time.

To get started, first install Inference and the Inference CLIP package:

pip install inference inference[clip]

Next, retrieve your API key from your Roboflow dashboard. If you do not already have a Roboflow account, you can create one for free.

Once you have retrieved your API key, set it in an environment variable called ROBOFLOW_API_KEY:

export ROBOFLOW_API_KEY=""

Next, create a new Python file and add the following code:

import cv2
import inference
from inference.core.utils.postprocess import cosine_similarity

from inference.models import Clip
clip = Clip()

prompt = "a coffee cup"
text_embedding = clip.embed_text(prompt)

def render(result, image):
    # get the cosine similarity between the prompt & the image
    similarity = cosine_similarity(result["embeddings"][0], text_embedding[0])

    # scale the result to 0-100 based on heuristic (~the best & worst values I've observed)
    range = (0.15, 0.40)
    similarity = (similarity-range[0])/(range[1]-range[0])
    similarity = max(min(similarity, 1), 0)*100

    # print the similarity
    text = f"{similarity:.1f}%"
    cv2.putText(image, text, (10, 310), cv2.FONT_HERSHEY_SIMPLEX, 12, (255, 255, 255), 30)
    cv2.putText(image, text, (10, 310), cv2.FONT_HERSHEY_SIMPLEX, 12, (206, 6, 103), 16)

    # print the prompt
    cv2.putText(image, prompt, (20, 1050), cv2.FONT_HERSHEY_SIMPLEX, 2, (255, 255, 255), 10)
    cv2.putText(image, prompt, (20, 1050), cv2.FONT_HERSHEY_SIMPLEX, 2, (206, 6, 103), 5)

    # display the image
    cv2.imshow("CLIP", image)
    cv2.waitKey(1)

# start the stream
inference.Stream(
    source="webcam",
    model=clip,

    output_channel_order="BGR",
    use_main_thread=True,

    on_prediction=render
)

In this code, we use the inference.Stream() method to run CLIP over all frames in the video. We set the prompt a coffee cup. The code above will calculate a CLIP embedding for the prompt. This CLIP vector will then be compared to CLIP vectors calculated for each frame in the video.

We show text on each frame that shows how similar CLIP thinks the embedding corresponding with the frame is to our prompt embedding.

In this example, if we hold up a coffee cup, the similarity will increase:

0:00
/0:08

When the coffee cup comes into view, the percentage similarity increases; when the coffee cup goes out of view, the similarity decreases.

You could amend the code above to work with multiple prompts. Then, you could take the prompt with the most similar embedding to the embedding for each frame and use it as a label.

The following guides may be useful as you continue experimenting with CLIP:

Future Use-Cases

CLIP will be used in many more creative ways in the future. We're working on a CLIP API to make it easier to try building these types of projects. If you'd like early access, please reach out.

Fine-Tuning CLIP

Unfortunately, for many hyper-specific use-cases (eg examining the output of microchip lithography) or identifying things invented since CLIP was trained in 2020 (for example, the unique characteristics of CLIP+VQGAN creations), CLIP isn't capable of performing well out of the box for all problems. It should be possible to extend CLIP (essentially using it as a fantastic checkpoint for transfer learning) with additional data.

Object detection

In much the same way we used CLIP to do object tracking, it's conceivable that you could use it for object detection as well. One naive way of doing this would be to feed every candidate anchor box into CLIP and determine which ones are the closest matches to your objects of interest.

Video Indexing

If you can classify images, it should be doable to classify frames of videos. In this way you could automatically split videos into scenes and create search indexes. Imagine searching YouTube for your company's logo and magically finding all of the places where someone happened to have used your product.

Conclusion

Let us know if you have seen or used CLIP in other novel and interesting ways! And if you want to try CLIP yourself, try our tutorial or convert and object detection dataset into a classification dataset for use with CLIP.