Imagine a search engine that lets you find frames in a video given a text prompt. For example, a search engine that helps you find all of the title credits in a trailer, all of the adverts in a broadcast, or all of the scenes that are in an office. With Roboflow’s Video Inference API, it is possible to build such a search engine.

In this guide, you will learn how to use the Roboflow Video Inference API to calculate CLIP vectors for frames in a video. We will then use these vectors to build a search engine that accepts text queries and returns timestamps for relevant frames.

By the end of this guide, we will build a video search engine that looks like this:

0:00
/0:13

Without further ado, let’s get started.

How to Build a Video Search Engine

We are going to run CLIP on frames in a video. CLIP is a multimodal model that you can use to calculate embeddings. These embeddings contain semantic information about the contents of an image.

You can compare embeddings from video frames to text embeddings calculated by CLIP. The higher the similarity between each frame and text embedding, the more similar the text is to the frame.

To build our search engine, we need to:

  1. Calculate CLIP vectors for frames in a video, and;
  2. Create a system that lets you compare each frame vector with an embedding associated with a text query.

Let’s talk through each of these steps.

Step #1: Calculate CLIP Vectors for Video Frames

To build a video search engine, we need to calculate CLIP vectors for frames in a video. We will use these vectors to find frames related to a text query. This leaves a question: how can you calculate CLIP vectors?

For that, we are going to use the Roboflow Video Inference API, which lets you run computer vision models, from fine-tuned object detection models trained on Roboflow to foundation models like CLIP to a gaze detection model, on frames in a video.

All CLIP embedding calculations run on our hosted infrastructure. All you need to do is upload an image and we will return the frame numbers on which inference was run and the associated CLIP vectors.

Let’s start calculating CLIP vectors. First, download the Roboflow Python package:

pip install inference

Next, create a free Roboflow account. Go to the Roboflow dashboard and click “Settings” then “Roboflow API key” to retrieve your API key. Learn how to retrieve your API key.

Create a new Python file and add the following code:

import json

from roboflow import CLIPModel

model = CLIPModel(api_key="API_KEY")

job_id, signed_url, expire_time = model.predict_video(
    "trailer.mp4",
    fps=3,
    prediction_type="batch-video",
)

results = model.poll_until_video_results(job_id)

with open("results.json", "w") as f:
    json.dump(results, f)

Above, replace API_KEY with your Roboflow API key.

You will need a video on which to run inference. In this guide, we will use a trailer from a movie; you can use any media with which you are working.

Replace trailer.mp4 with the name of the video on which you want to run inference.

Above, we have set the inference FPS to 3. This means that we will calculate a CLIP vector for every third frame in the video. The higher the FPS, the more accurate the timestamps will be in your search engine. Using a higher FPS involves more computation, which will increase the cost associated with calculating CLIP vectors. We recommend 3-5 FPS for most applications.

When you run the script above, your video will be uploaded to a signed URL. You can use this URL in future requests for a limited period of time, described in the expire_time variable. Then, a video inference job will start, which has an associated job ID. Your videos will be deleted after this time. The video inference API does not support long-term media storage.

Our code uses the poll_until_video_results() function which polls the Roboflow video inference API to check if inference results are available. If results are available, we save them to a file; otherwise, the function continues to poll the API every 30 seconds.

To check the API manually, you can use:

results = model.poll_for_video_results(job_id)

Where job_id is the ID of your video inference job.

Step #2: Set Up the Roboflow Video Search Template

The Roboflow video search template contains a Flask web server that you can use to search CLIP vectors returned by the video inference API.

This web server has one route: a page which accepts text queries and calculates the timestamps that contain frames most related to the text query.

You can use our template as a starting point for video search use cases. We have written all the code you need to compare CLIP vectors and calculate the timestamps associated with frames. You can copy the code from our web server example for use in your own logic.

To download the template, run the following command:

git clone https://github.com/roboflow/templates

Next, navigate to the video search example directory and install the required dependencies:

cd templates/video-search
pip install -r requirements.txt

Move the results.json file that was calculated in the last step to the video-search directory:

mv ~/path/to/results.json .

Now we are ready to start the video search engine:

python3 app.py

The search engine will be available at `http://localhost:5000`. Let’s run a few queries that show our search engine in action:

0:00
/0:13

Our search engine successfully returns frames related to text queries.

When we make a search query, our application:

  1. Calculates a CLIP vector associated with the text query.
  2. Compares the text CLIP vector to all CLIP vectors that were calculated by the Roboflow video inference API. We use cosine similarity to measure how similar each text vector is to the video frame vectors, and set a threshold to 0.1. You can customize this value by changing the `THRESHOLD` value in the `app.py` script.
  3. Calculates the timestamps associated with the frames whose CLIP vectors are similar to the text query CLIP vector.
  4. Returns the timestamps in the application front-end.

Conclusion

The Roboflow Video Inference API enables you to run computer vision models on frames in videos. You can run inference at a custom FPS.

In this guide, we showed how to run CLIP on frames in a video using the video inference API. You can also run custom models that were trained on Roboflow. We then used these vectors with a video search template to enable searching our video with text queries.

You can use this technology and methodology to create video understanding applications and features such as contextual advertising, video recommendations, media analytics, brand safety, and content moderation.