Computer vision applications regularly deal with recorded (vs realtime) video. For example, you may want to analyze a large catalog of video for use in media indexing; process drone footage after its flight; conduct content moderation of videos uploaded to your service; analyze media from an online exam to check for cheating; process video from CCTV footage; and so much more.

Computer vision and multimodal models can be applied to video understanding, and they require dedicated infra to handle terabytes data for when you need to process large volumes of videos, especially asynchronously.

We are excited to introduce the Roboflow Video Inference API, which allows you to apply computer vision models to videos in the cloud. With the video inference API, you can use the video infrastructure we've relied upon internally, which handles things like batching requests, concurrent processing, and complexity associated with scaling with inference on the frames in a video.

You upload a video (or provide a URL to a video) and Roboflow will handle efficiently running your custom model (or a number of foundation models for common tasks like OCR and tagging common objects) across a video and returning the predictions in structured JSON.

Here is an example of a semantic search application built on top of the Roboflow Video Inference API:


Computer vision and multimodal powered video understanding enables applications such as media search, content moderation, video metadata enrichment, and video analysis. Combining video and AI unlocks an entirely new realm of possibilities for anyone using video in their business.

In this guide, you will learn how the Video Inference API works and how to run a fine-tuned computer vision model on a video. We will then show you how to render predictions returned by the model directly onto the video using Supervision, an open source Python package powering hundreds of vision applications. 

Without further ado, let’s get started!

What is the Roboflow Video Inference API?

The Roboflow Video Inference API is a hosted solution that enables you to run any Roboflow Inference model for video processing. You can run custom object detection, segmentation, and classification models for identifying objects of interest specific to your domain.

You can also use open source foundation models like CLIP, OpenAI’s multimodal zero-shot classification model and gaze detection to solve different vision problems.

The Video Inference API is designed to scale with any use case, whether you are working with a small collection of videos, an archive of years of material, or videos that come in every day. Use secure signed-URL endpoints to upload videos from S3, GCS, or Azure Blob Storage with confidence. All this makes adding video inference a secure and frictionless experience. 

For example, you could use video inference to:

  1. Identify scenes for dynamic ad insertion;
  2. Generate tags for use in building media search engines;
  3. Analyze videos for brand and safety requirements;
  4. Run security footage to detect objects of interest;
  5. Blur or obscure people or sensitive information
  6. Highlight objects of interest
  7. Count objects crossing a threshold over time
  8. And more.

How to Analyze a Video with Roboflow

Suppose we have a football game and we want to track player movement across the game. We can use computer vision to identify the position of all players in each frame in the video. We can then apply post-processing to visualize predictions from the model and add business logic.

We will use the following football video:


We can process this video using the Video Inference API. This API, available in the Roboflow Python package, accepts an uploaded video or can take videos from public URLs, runs inference on frames at the specified frame rate, and returns a result.

If you don’t already have a model trained on Roboflow, check out our guide to getting started with Roboflow. You can also upload models you have trained elsewhere to Roboflow. Learn how to upload a model to Roboflow.

First, install the Roboflow Python package:

pip install roboflow

Next, create a new Python file and add the following code:

from roboflow import Roboflow

rf = Roboflow(api_key="API_KEY")
project = rf.workspace().project("MODEL_ID")
model = project.version(2).model

job_id, signed_url = model.predict_video(

results = model.poll_until_video_results(job_id)

Above, replace API_KEY with your Roboflow API key, MODEL_ID with your Roboflow model ID, and VERSION with your model version. Learn how to retrieve your Roboflow API key. Learn how to retrieve your model version.

Replace football-video.mp4 with the name of the file on which you want to run inference, or the URL of the video on which you want to run inference. 

You can also run multiple models at the same time. For example, you could run CLIP and a custom-trained object detection model. To run multiple models concurrently, use:

job_id, signed_url = model.predict_video(
    additional_models = [“clip”]

results = model.poll_until_video_results(job_id)

Supported additional models are:

  • CLIP (clip)
  • Gaze detection (gaze)

Replace `fps` with a value that makes sense for your use case. The higher this number is, the longer inference will take, and thus the more expensive the operation will be. Refer to the Roboflow pricing page to learn more about how much video inference costs.

For most use cases, an FPS value of 5 or less is appropriate. A 5 FPS value will run inference ~5 times every second for a video. This will allow you to get a high degree of fidelity without running inference on every frame. This FPS number cannot exceed the video FPS.

For most use cases, we recommend running inference at 3-5 FPS. As a best practice, always keep video_fps to be divisible by the inference_fps. Otherwise, the returned frame_offset will be inaccurate.

To check your video FPS, you can use the following command:

ffmpeg -i my_video.mp4 2>&1 | sed -n "s/., \(.\) fp.*/\1/p"

In this guide, we will be plotting predictions on a video. For this purpose, we need to set the inference FPS to the same video FPS for the video with which we are working, to ensure there are no gaps in the demo. For most use cases, running inference on every frame is unnecessary. We don’t recommend running at a high FPS unless you need inference results for every frame.

In this code, we import the Roboflow Python package, instantiate a model, and then run inference on a video. The predict_video() function accepts either a URL to a video or a local path. If a file is stored locally, the file will be uploaded to a signed Google Cloud URL which you can use for future requests.

Next, predict_video() enqueues a job on the Roboflow Video Inference API. This job will run inference on your model at the FPS rate specified.

The predict_video() function returns a private, signed URL where your video was uploaded (if you specified a file from your local computer on which to run inference) and a job ID. This job ID can be used to poll the Video Inference API for results.

The model.poll_until_video_results(job_id) will check for results every 60 seconds until results are available. You can also check for results manually using model.poll_for_video_results(job_id).

When the Video Inference API has finished running inference, the inference results will be available in `results`.

The results will be a JSON array with the following keys:

  1. frame_offset: The frames on which inference was run. If you set FPS = 5, for example, the frame_offset will go 0, 6, 12, etc. If you set FPS = 30, the frame_offset will be 0, 1, 2, etc.
  2. MODEL_ID: Where `MODEL_ID` represents the ID from the model(s) applied. This key will have a list of inference results. Each list object will correspond to the frame number at the same index in the `frame_offset` list.

Here is an example response from the API:

{"frame_offset": [0, 6, 12],
“football”: [{"predictions": [{"x": 1570.0, "y": 319.5, "width": 700.0, "height": 635.0, "confidence": 0.4201706647872925, "class": "Scissors", "class_id": 2}]}..]

Now that we have results from our video, we can apply business logic.

As mentioned, we set the FPS value to 24 for this guide so that we can plot predictions on the original video with which we were working. This is because we need predictions for every frame so that we can map the predictions to frames in the video.

If we only had predictions for every fifth frame, for example, most frames wouldn’t show predictions; the predictions would flicker and be difficult to interpret. 

We can plot predictions using Roboflow supervision, an open source Python package that provides utilities for building computer vision applications. First, install supervision:

pip install supervision

Then, add the following code to your Python file:

def callback(scene: np.ndarray, index: int) -> np.ndarray:
    results = results[index]
    detections = sv.Detections.from_ultralytics(results)

    bounding_box_annotator = sv.BoundingBoxAnnotator()
    label_annotator = sv.LabelAnnotator()

    labels = [
        for class_id
        in detections.class_id

    annotated_image = bounding_box_annotator.annotate(
        scene=scene, detections=detections)

    annotated_image = label_annotator.annotate(
        scene=annotated_image, detections=detections, labels=labels)

    return annotated_image


This code will load our football video, iterate over each frame, plot detections from our model, then save the results to a file called output.mp4. Here is the result of our script:


Our script has successfully run inference, waited for predictions, plotted predictions onto the source video, and saved the results to a file.


Roboflow Video Inference API is a hosted solution for running inference on videos. You can use any fine-tuned model hosted on Roboflow or supported foundation models (CLIP, gaze detection, etc).

To learn more about how to use video inference, check out the Video Inference API documentation. To learn about video inference pricing, check out the Roboflow pricing page.