Multimodal Video Analysis with CLIP using Intel Gaudi2 HPUs

You can use the multimodal model CLIP to classify frames in a video. This is useful for media indexing use cases where you want to assign labels to images. You could assign a single label to a video (i.e. whether a video does or does not contain a violent scene) or multiple labels (i.e. a video contains an office scene, a park scene, and more).

In this guide, we are going to classify the frames in a video for data use in data analytics using CLIP, an open source multimodal mode by OpenAl, and the Gaudi2 system. Gaudi2 is developed by Habana, an Intel company. We are going to use this system to evaluate whether a video contains various scene descriptors. We will search for a park scene and an office scene in a video.

The Gaudi2 system is designed for large scale applications, offering 96 GB HBM2E memory and dual matrix multiplication engines in each chip. You could use the guidance in this tutorial to scale up to processing thousands of videos.

Without further ado, let’s get started!

What is CLIP?

Contrastive Language Image Pre-training (CLIP) is a multimodal computer vision model developed by OpenAI. With CLIP, you can compare the similarity of two images, or the similarity of an image to a series of text labels. You can use the latter functionality to classify video frames.

To use CLIP for video classification, we will:

Install CLIP.
Calculate CLIP vectors for every frame in a video.
Identify the most similar class to each frame.
Assign tags to timestamps in the video.

Step #1: Install Dependencies

For this tutorial, we are going to install the Transformers implementation of CLIP. We can specify that we want to use our Gaudi2 chip, which is optimized for machine learning workloads, to run CLIP with the Transformers CLIP implementation.

To install Transformers, run the following command

pip install transformers

To process our video, we are going to use the supervision Python package. This package contains a range of utilities for use in building computer vision applications. We will use the supervision video functionalities to divide a video into frames. We will then classify each frame with CLIP.

To install supervision, run:

pip install supervision

We are now ready to start writing the logic for our application.

Step #2: Calculate CLIP Vectors for Video Frames

Let’s classify the trailer of the movie Contact. This trailer features scenes that include computers, offices, satellites, and more.

Before we can classify each frame, we need to compute CLIP vectors for each frame. Once we have these embeddings, we can compare text embeddings from labels (i.e. “computer”, “park”, “satellite”) to each frame to identify which label most accurately represents each frame.

Let's start by importing the required dependencies and initializing a few variables we will use in our script:

try:
    import habana_frameworks.torch.core as htcore

    DEVICE = "hpu"
except ImportError:
    DEVICE = "cpu"

from transformers import CLIPProcessor, CLIPModel
import supervision as sv
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(DEVICE)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

VIDEO = "trailer.mp4"

PROMPTS = ["satellite", "office", "park", "computer", "outdoors", "meetings", "something else"]
INTERVAL_PERIOD = 5
results = []

In the code above, we:

Import the required dependencies
Load the CLIP model and cast it to our HPU chip
Declare variables we will use throughout our script

In the script, replace:

VIDEO with the name of the video you want to analyze.
PROMPTS with the prompts you want to use in video classification.
INTERVAL_PERIOD with the interval (in seconds) you want to use to analyze your video by timestamp. A 5 value means that we'll generate a report that shows the most common prompt every 5 seconds in our video later on in this guide.

CLIP works with an open vocabulary. This means there is no “master list” of prompts that you can specify. Rather, you can specify any label you want. With that said, we recommend testing different labels to see which ones are most effective for your use case.

In this guide, we use the prompts:

satellite
office
park
computer
outdoors
meetings
something else

“something else” is an effective null class. If none of the specified labels match, “something else” is more likely to match.

Next, we need to declare a few functions to help calculate embeddings. We will use embeddings to analyze the contents of our video.

def get_image_embedding(image):
    inputs = processor(images=[image], return_tensors="pt", padding=True).to(DEVICE)

    outputs = model.get_image_features(**inputs)

    return outputs.cpu().detach().numpy()

def get_text_embedding(text):
    inputs = processor(text=[text], return_tensors="pt", padding=True).to(DEVICE)

    outputs = model.get_text_features(**inputs)

    return outputs.cpu().detach().numpy()

def classify_image(image, prompts):
    image_embedding = get_image_embedding(image)

    sims = []

    for prompt in prompts:
        prompt_embedding = PROMPT_EMBEDDINGS[prompt]

        sim = cosine_similarity(image_embedding, prompt_embedding)
        sims.append(sim)

    return PROMPTS[np.argmax(sims)]

In the code above, we declare three functions: one to calculate image embeddings, one to calculate text embeddings, and one that uses all our text embeddings and a single image embedding to return a single label for a frame.

Now, let's analyze our video!

Video Analysis with CLIP

For each frame in our video, we want to find the most relevant label. We can do this using the following algorithm:

Calculate text embeddings for all our prompts and save them for later use.
Take a video frame.
Calculate an image embedding for the video frame.
Find the text embedding that is most similar to the video frame.
Use the label associated with that text embedding as a label for the frame.

We can repeat this process for each frame in our video to classify video frames.

Add the following code to the Python script we started in the last step:

PROMPT_EMBEDDINGS = {prompt: get_text_embedding(prompt) for prompt in PROMPTS}

for i, frame in enumerate(sv.get_video_frames_generator(source_path=VIDEO, stride=1)):
    result = classify_image(frame, PROMPTS)

    results.append(result)
    
video_length = 10 * len(results)

video_length = video_length / 24
video_length = round(video_length, 2)

print(f"The video is {video_length} seconds long")

timestamps = {}

for i, result in enumerate(results):
    closest_interval = int(i / INTERVAL_PERIOD) * INTERVAL_PERIOD

    if closest_interval not in timestamps:
        timestamps[closest_interval] = [result]
    else:
        timestamps[closest_interval].append(result)

for key, value in timestamps.items():
    prev_key = max(0, key - INTERVAL_PERIOD)

    most_common = max(set(value), key=value.count)

    print(f"From {prev_key} to {key + INTERVAL_PERIOD} seconds, the main category is {most_common}")

In this code, we open our video file and, for each frame, calculate the most relevant label. We then calculate how long our video is. Finally, we group labels by interval (the INTERVAL_PERIOD value we set earlier).

For each interval (i.e. 0-5s, 5-10s, 10-15s), we find the most common label. We then assign that as a label for that timestamp.

Let's run our script on the Contact trailer. Here is an excerpt of the results:

From 0 to 5 seconds, the main category is satellite
From 0 to 10 seconds, the main category is satellite
From 5 to 15 seconds, the main category is satellite
From 10 to 20 seconds, the main category is satellite
From 15 to 25 seconds, the main category is satellite
From 20 to 30 seconds, the main category is satellite
From 25 to 35 seconds, the main category is satellite
...
From 1970 to 1980 seconds, the main category is satellite
From 1975 to 1985 seconds, the main category is something else
From 1980 to 1990 seconds, the main category is something else
From 1985 to 1995 seconds, the main category is computer
From 1990 to 2000 seconds, the main category is computer
From 1995 to 2005 seconds, the main category is computer
...

Our script has successfully assigned labels to different timestamps in our video.

We can process the timestamps further to understand what percentage of a video matches each prompt. To do so, we can use this code:

percentage_of_video_prompt_coverage = {prompt: 0 for prompt in PROMPTS}

for prompt in PROMPTS:
    counter = results.count(prompt)

    percentage_of_video_prompt_coverage[prompt] = counter / len(results)

for prompt, percentage in percentage_of_video_prompt_coverage.items():
    print(f"The prompt {prompt} is present in {round(percentage * 100, 2)}% of the video")

When analyzing the first few seconds of our video, we get the breakdown:

The prompt satellite is present in 30.31% of the video
The prompt office is present in 3.45% of the video
The prompt park is present in 0.0% of the video
The prompt computer is present in 39.93% of the video
The prompt outdoors is present in 4.47% of the video
The prompt meetings is present in 0.67% of the video
The prompt something else is present in 21.16% of the video

With this logic, we can make determinations about our video.

For example, if more than 10% of a video matches “computer”, we could classify the trailer with a label like “Technology” in an internal media database. In a broadcasting scenario, we may hold any trailers that contain violence until after a certain time of day, critical for complying with “watershed” legislation where violent materials cannot be broadcast on air.

Using CLIP and Gaudi2 for Video Classification

You can use CLIP to identify the most relevant label to a frame in a video. You can use this logic to classify videos. You can classify videos by timestamp. This is useful for searching a video. For example, you could build a search engine that lets you search a specific video for a scene that matches a label like “computer”.

To compute CLIP vectors for our application, we used a Gaudi2 chip. Gaudi2 is designed for high-performance AI workloads. The Transformers implementation of CLIP we used in this guide is optimized for use on Gaudi2.

You could use the logic we wrote in this guide to index a large repository of videos in a batch. Or you could build a queue that classifies videos as they are submitted to a system.

To learn more about the Gaudi2 system, refer to the Gaudi2 product reference on the Intel Habana website.

Cite this Post

Use the following entry to cite this post in your research:

James Gallagher. (Mar 3, 2024). Multimodal Video Analysis with CLIP using Intel Gaudi2 HPUs. Roboflow Blog: https://blog.roboflow.com/multimodal-video-analysis-gaudi2/

Discuss this Post

If you have any questions about this blog post, start a discussion on the Roboflow Forum.