Computer vision models are normally trained to give you predictions on a single image at a time. The input to these models are often individual photos or frames from recorded videos collected asynchronously. But oftentimes you'll want to get predictions from a camera feed in realtime.

The final result.

We previously showed how to use images from your Nest Camera with your Roboflow models in Node.js and in your web browser with Tensorflow.js.

In this post, we'll demonstrate how to pipe your webcam data to your Roboflow Trained model for realtime inference using Python and a USB webcam connected to your computer.

The Final Code

repo for versioning snippets that show how to use Roboflow APIs - roboflow-ai/roboflow-api-snippets

The final code for this demo can be found in our API Snippets Github Repo.


We'll be using a custom-trained Roboflow model for this tutorial so you will need to train one first by following the steps here.


We'll need Python 3.7+ for this demo. Then install

  • opencv-python to connect to your webcam and transform the image data.
  • numpy to convert the pixel data to an array and back.
  • requests to send the image to your model API and retrieve the resulting prediction.
pip3 install opencv-python numpy requests
Install the dependencies

Coding the Demo

We first create a file called and initialize it with our Roboflow model info (train a model with Roboflow Train first and obtain your API Key from your Roboflow Settings) and the packages we just installed.


import cv2
import base64
import numpy as np
import requests

Then we'll construct our model's API endpoint URL. This example uses the Hosted Inference API but you can also use our on-device deployments by swapping out with its IP address (eg

upload_url = "".join([

Next we'll open up a connection to our webcam with OpenCV:

video = cv2.VideoCapture(0)

And define an infer function that will be the core logic of our program. It performs the following operations each time you call it:

  1. Retrieve the current image from the webcam.
  2. (Optional) Resizes it to our model's input size to save bandwidth and increase speed.
  3. Converts it to a base64-encoded string.
  4. Sends a POST request to our trained model's API endpoint.
  5. Parses the resulting predictions and returns them as an image we can display.
# Infer via the Roboflow Infer API and return the result
def infer():
    # Get the current image from the webcam
    ret, img =

    # Resize (while maintaining the aspect ratio) to improve speed and save bandwidth
    height, width, channels = img.shape
    scale = ROBOFLOW_SIZE / max(height, width)
    img = cv2.resize(img, (round(scale * width), round(scale * height)))

    # Encode image to base64 string
    retval, buffer = cv2.imencode('.jpg', img)
    img_str = base64.b64encode(buffer)

    # Get prediction from Roboflow Infer API
    resp =, data=img_str, headers={
        "Content-Type": "application/x-www-form-urlencoded"
    }, stream=True).raw

    # Parse result image
    image = np.asarray(bytearray(, dtype="uint8")
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)

    return image
The core logic of our script.

Then we will call this function in a loop and display the current prediction image until we detect that the q key is pressed on the keyboard.

# Main loop; infers sequentially until you press "q"
while 1:
    # On "q" keypress, exit
    if(cv2.waitKey(1) == ord('q')):

    # Synchronously get a prediction from the Roboflow Infer API
    image = infer()
    # And display the inference results
    cv2.imshow('image', image)

And finally, after the loop is broken by the q key we will release the webcam resource and clean up our visualization resources.

# Release resources when finished

And that's it! If you run you'll see predictions from your model displayed on your screen overlayed atop images from your webcam. You can download the code for this simple synchronous webcam inference example here.

Speeding Things Up

This implementation is pretty simple; it gets predictions from the model sequentially which means it waits to send the next image until it has received the results of the previous one.

The exact speed it will infer at depends on your model and network connection but you can expect about 4 frames per second on the Hosted API and 8 frames per second on an NVIDIA Jetson Xavier NX running on your local network.

We can significantly increase our speed by parallelizing our requests. Keeping a buffer of images in memory adds a little bit of latency but improves the consistency with which we can display the frame. We have an example async webcam inference script demonstrating this approach here.


The full code is available on our Github. Excited to see what you build!