Computer vision is one of many tools you can use in content moderation. With computer vision, you can automatically find categories of content in a video, as opposed to using manual human effort. 

For example, consider a scenario where you are a video producer and want to know if a video clip contains alcohol. You could use computer vision to classify if the video contains alcohol. Using this information, you can trigger custom business logic, such as prohibiting the content being shown before a certain time in the day.

In this guide, we are going to demonstrate how to moderate video content with the Roboflow Video Inference API. We will use the CLIP model to identify specific scenes (i.e. violence, alcohol). By the end of this guide, you will be able to take a video and identify if the video contains specific categories of content.

We'll run our analysis on this video:

Without further ado, let’s get started.

What is CLIP?

Contrastive Language Image Pre-training (CLIP) is an open source computer vision model developed by OpenAI. You can use CLIP to calculate the similarity between two images and the similarity between images and text. With this capability, you can identify frames in a video that are similar to a text prompt, create media search engines that let you find images using text queries, cluster images, and more.

You can run CLIP on frames in a video using the Roboflow Video Inference API, which provides a flexible, hosted solution for using CLIP with videos. Our inference API will scale up with you, whether you are processing a single video or multiple terabytes per day.

Moderate Content with CLIP and Roboflow

We can compare frames in a video to a list of descriptions that match what we want to identify in a video. For example, consider a scenario where we want to identify scenes that contain violence or smoking in a video. We could accomplish this with CLIP using the following prompts:

  • Violence
  • Something else

You can set any arbitrary text prompt(s).

The second prompt, “something else”, is the category we want CLIP to return if no moderation prompt is identified. We could add different prompts, too. For example, if you wanted to identify explicit scenes in a video, you could set a prompt for explicit imagery. You can search for multiple categories at one time.

We don’t need these prompts for the video inference API, but we will need them later when we process CLIP results from the video inference API.

In this guide, we will work with a video that contains one violent scene.

Step #1: Install the Roboflow pip Package

The Roboflow Python SDK lets you run inference on videos in a few lines of code. To install the SDK, run the following command:

pip install roboflow

Step #2: Calculate CLIP Vectors

We are going to use the hosted Roboflow Video Inference API to calculate CLIP vectors for frames in a video. Create a new Python file and add the following code:

import json

from roboflow import CLIPModel

model = CLIPModel(api_key="API_KEY")

job_id, signed_url, expire_time = model.predict_video(

results = model.poll_until_video_results(job_id)

with open("results.json", "w") as f:
    json.dump(results, f)

Above, replace:

  1. API_KEY with your Roboflow API key. Learn how to retrieve your Roboflow API key.
  2. trailer.mp4 with the name of the video on which you want to run inference. You can also provide a URL that points to a video.
  3. fps=3 with the frames per second to use in inference. FPS = 3 means that inference will be run three times every second. The higher the inference number, the more frames on which inference will be run. To learn more about pricing for video inference, refer to the Roboflow pricing page.

The script above will start a video inference job on the Roboflow cloud. The poll_until_video_results function will poll the Roboflow API every 60 seconds to check for results. When results are available, the results are saved to a file.

The file contains:

  1. The frames on which inference was run, in `frame_offset`.
  2. The timestamps that correspond with the frames on which inference was run.
  3. The CLIP vectors from inference.

Step #3: Compare Moderation Labels with CLIP Vectors

The Roboflow Video Inference API returns raw CLIP vectors. This is because there are many different tasks you can accomplish with CLIP vectors. For this guide, we will focus on using CLIP to identify if a video contains a violent scene.

Create a new Python file and add the following code:

import json

import torch
import clip
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity

import numpy as np

with open("results.json", "r") as f:
    results = json.load(f)

frames = []

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/16", device=device)

prompts = ["violence", "something else"]

text = clip.tokenize(prompts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text)

    prompts_to_features = list(zip(prompts, text_features))

    buffer = []

    for result in results["clip"]:
        results_for_frame = {}

        for prompt, embedding in prompts_to_features:
            results_for_frame[prompt] = cosine_similarity(
                embedding.cpu().numpy().reshape(1, -1), np.array(result).reshape(1, -1)

        buffer.append(max(results_for_frame, key=results_for_frame.get))

    # if 5 detections in a row are True, then we have a match
    match = False

    for i in range(len(buffer) - 5):
        if buffer[i : i + 5] == ["violence"] * 5:
            match = True

print("Match for 'violence':", match)

In this code, we compute CLIP vectors for our two prompts: “violence” and “something else”. We then calculate how similar each frame CLIP vector is to the prompts.

If the CLIP vector associated with the prompt “violence” is more similar to the vector for “something else” for 5 of the last 20 vectors on a rolling basis, we will stop iterating over the video and record that the video contains violence. Our code is implemented this way to ensure that one or two false positives do not prevent a scene from being classified as violent.

Here is the result from our script:

Match for 'violence': True

Our video – a movie trailer – contains violent scenes. Our script has successfully identified that the video contains violent scenes.

With the code above, you can make determinations based on your business logic. For example, you may determine that video that contain violence (i.e. that depict a violent movie) need to go to a human reviewer for further review. Or if you are running a community, you may reject content that contains violent scenes.


You can use CLIP with the Roboflow Video Inference API to identify if a video contains any scenes that are not appropriate. For example, you can identify scenes that contain violence, explicit scenes, or other types of content.

In this guide, we walked through how to run CLIP on frames in a video with the video inference API. We then wrote code that lets you compare moderation labels to the CLIP vectors associated with video frames. This code allows you to classify if a video contains content that you want to moderate.