How to Analyze and Classify Video with CLIP

CLIP, a computer vision model by OpenAI, can be applied to solve a range of video analysis and classification problems. Consider a scenario where you want to archive and enable search on a collection of advertisements. You could use CLIP to classify videos into various categories (i.e. advertisements featuring football, the seaside, etc.). You can then use these categories to build a media search engine for advertisements.

In this guide, we are going to show how to analyze and classify video with CLIP. We will take a video with five scenes that is featured on the Roboflow homepage. We will use CLIP to answer three questions about the video:

  1. Does the video contain a construction scene?
  2. If the video contains a construction scene, when does that scene begin?
  3. How long do construction scenes last?

Here is the video with which we will be working:

0:00
/

The approach we use in this article could be used to solve other media analytics and analysis problems, such as:

  1. Which of a series of categories best describes a video?
  2. Does a video contain a restricted item (i.e. alcohol)?
  3. At what timestamps do specific scenes take place?
  4. How long is an item on screen?

Without further ado, let’s get started!

How to Classify Video with CLIP

To answer the questions we had earlier – does a video contain a construction scene, and when does that scene begin – we will follow these steps:

  1. Install the required dependencies
  2. Split up a video into frames
  3. Run CLIP to categorize a limited set of frames

Step #1: Install Required Dependencies

We are going to use CLIP with the Roboflow Inference Server. The Inference Server provides a web API through which you can query Roboflow models as well as foundation models such as CLIP. We're going to use the hosted Inference Server, so we don't need to install it.

We need to install the Roboflow Python package and supervision, which we will use for running inference and working with video, respectively:

pip install roboflow supervision

Now we have the required dependencies installed, we can start classifying our video.

Step #2: Write Code to Use CLIP

To start our script to analyze and classify video, we need to import dependencies and set a few variables that we will use throughout our script;

Create a new Python file and add the following code:

import requests
import base64
from PIL import Image
from io import BytesIO
import os

INFERENCE_ENDPOINT = "https://infer.roboflow.com"
API_KEY = "API_KEY"
VIDEO = "./video.mov"

prompts = [
    "construction site",
    "something else"
]

ACTIVE_PROMPT = "construction site"

Replace the following values above as required:

  • API_KEY: Your Roboflow API key. Learn how to retrieve your Roboflow API key.
  • VIDEO: The name of the video to analyze and classify.
  • prompts: A list of categories into which each video frame should be classified.
  • ACTIVE_PROMPT: The prompt for which you want to compute analytics. We use this earlier to report whether a video contains the active prompt, and when the scene featuring the active prompt first starts.

In this example, we are searching for scenes that contain a construction site. We have provided two prompts: “construction site" and "something else".

Next, we need to define a function that can run inference on each frame in our video:

def classify_image(image: str) -> dict:
    image_data = Image.fromarray(image)

    buffer = BytesIO()
    image_data.save(buffer, format="JPEG")
    image_data = base64.b64encode(buffer.getvalue()).decode("utf-8")

    payload = {
        "api_key": API_KEY,
        "subject": {
            "type": "base64",
            "value": image_data
        },
        "prompt": prompts,
    }

    data = requests.post(INFERENCE_ENDPOINT + "/clip/compare?api_key=" + API_KEY, json=payload)

    response = data.json()

    highest_prediction = 0
    highest_prediction_index = 0

    for i, prediction in enumerate(response["similarity"]):
        if prediction > highest_prediction:
            highest_prediction = prediction
            highest_prediction_index = i

    return prompts[highest_prediction_index]

This function will take a video frame, run inference using CLIP and the Roboflow Inference Server, then return a classification for that frame using the prompts we set earlier.

Finally, we need to call this function on frames in our video. To do so, we will use supervision to split up our video into frames. We will then run CLIP on each frame:

results = []

for i, frame in enumerate(sv.get_video_frames_generator(source_path=VIDEO, stride=10)):
    print("Frame", i)
    label = classify_image(frame)

    results.append(label)
    
video_length = 10 * len(results)

video_length = video_length / 24

print(f"Does this video contain a {ACTIVE_PROMPT}?", "yes" if ACTIVE_PROMPT in results else "no")

if ACTIVE_PROMPT in results:
    print(f"When does the {ACTIVE_PROMPT} first appear?", round(results.index(ACTIVE_PROMPT) * 10 / 24, 0), "seconds")

print(f"For how long is the {ACTIVE_PROMPT} visible?", round(results.count(ACTIVE_PROMPT) * 10 / 24, 0), "seconds")

This code sets a stride value of 10. This means that a frame will be collected for use in classification every 10 frames in the video. For faster results, set a higher stride value. For precise results, set the stride to a lower value. A stride value of 10 means ~2 frames are collected per second (given a 24 FPS video).

After the code above has run CLIP on the video, the code then finds:

  1. Whether the video contains a construction site;
  2. When the construction scene begins, and;
  3. How long the construction scene lasts.

Let’s run our code and see what happens:

Does this video contain a construction site? yes
When does the construction site first appear? 7 seconds
For how long is the construction site visible? 6 seconds

Our code has successfully identified that our video contains a construction scene, has identified a time at which the scene starts, and the duration of the scene. CLIP did, however, include the shipyard scene as construction.

This is why the "construction site visible" metric is six seconds instead of the ~3 seconds for which the actual construction site is visible. CLIP is likely interpreting the moving heavy vehicles and the general environment of the shipyard as construction, although no construction is going on.

CLIP is not perfect: the model may not pick up on what is obvious to humans. If CLIP doesn't perform well for your use case, it is worth exploring how to create a purpose-built classification model for your project. You can use Roboflow to train custom classification models.

Note: The timestamps returned are not fully precise because we have set a stride value of 10. For more precise timestamps, set a lower stride value. Lower stride values will run inference on more frames, so inference will take longer.

Conclusion

CLIP is a versatile tool for which there are many applications in video analysis and classification. In this guide, we showed how to use the Roboflow Inference Server to classify video with CLIP. We used CLIP to find whether a video contains a particular scene, when that scene begins, and what percentage of the video contains that scene.

If you need to identify specific objects in an image – company logos, specific products, defects – you will need to use an object detection model instead of CLIP. We are preparing a guide on this topic that we will release in the coming weeks.