Video Object Counting: A Step-by-Step Guide

Using tools like Roboflow Supervision, OpenCV, and YOLO, you can track and count unique objects in videos. This is useful for a wide range of use cases, from calculating analytics about a game of football to tracking how many products are present on an assembly line at a given point in time.

In this guide, we are going to walk through how to count objects in videos.

Here is a demo showing what we will build:

Technologies Used

To build a robust and efficient object-counting system, we will use the following tools:

OpenCV
Roboflow Inference
Supervision

Each of these plays a crucial role in object counting. Let’s overview each of them.

OpenCV

OpenCV (Open Source Computer Vision Library) is a foundational tool for computer vision tasks. In this setup, OpenCV will be the interface to process the video feed, enabling us to extract and analyze individual frames for object detection.

Roboflow Inference

Roboflow Inference allows you to run any of the 50,000+ pre-trained public models available on Roboflow Universe, as well as custom fine-tuned models created on or uploaded to the Roboflow platform. Inference has assigned IDs to commonly used models for convenience, and these models do not require an API key, unlike other public or private models. For our use case we will use the YOLOv8 which can identify and classify objects within each frame of the video feed. After detecting the objects, we can count them once they enter the target region with the Supervision library.

Supervision

A library of tools created by Roboflow offers a streamlined way to annotate predictions from various object detection and segmentation models. It supports inference with the Inference, Ultralytics, or Transformers packages.

This will allow us to use different models and utilities to analyze the YOLO8 detections and achieve tracking and counting.

We’ll use supervision with the following video tracking algorithms and techniques:

ByteTrack: Tracking objects across frames is crucial in object counting, especially in dynamic environments where objects overlap or move rapidly. ByteTrack is an advanced multi-object tracking (MOT) algorithm for handling such challenges. Byte focuses only on high-confidence detection boxes and treats every detection box as a fundamental unit, much like a byte in a program.
Polygon Zone: For applications where counting is restricted to certain areas or zones, the Supervision Polygon Zone provides an additional layer of precision. This tool allows us to define specific regions within the video frame (such as a doorway, shelf, or section of a road) where object counting will be performed.

System Overview

The primary goal is to count objects accurately by employing preprocessing, detection, and tracking techniques. The following sections present the architecture diagram and define the workflow.

Fig 1: Architecture Design for Object Counting in a Video

Workflow

Video Feed: Capture video input as the primary data source for the counting system.
Preprocessing: Use OpenCV to perform basic preprocessing to produce single frames for detection.
Object Detection: Employ YOLOv8 to identify and classify objects within each frame.
Object Tracking: Implement ByteTrack to maintain consistent tracking of detected objects across frames, ensuring they’re accurately followed throughout the video.
Zone Check: If a detected and tracked object enters a defined polygonal target zone, it triggers a counting event.
Counter Update: Increment the counter by one (+1) each time an object passes through the target zone.

With the architecture and workflow defined, the next phase is implementing this system.

Implementation Steps

This guide will complete the process of setting up a system to draw polygons on a video frame and count objects within a defined target zone. Let’s start with the setup:

Step 1: Prerequisites

First, make sure you have the necessary libraries installed. We’ll use OpenCV for video processing, YOLOv8 for object detection, and Supervision for handling object tracking and counting. You can install them using the following commands:

pip install opencv-python-headless supervision inference

Additionally, import these libraries in your Python script:

import cv2
import supervision as sv
import numpy as np
from inference import get_model
from supervision.assets import download_assets, VideoAssets

This setup is required to ensure you can load and process video frames, perform object detection, and track objects across frames.

Step 2: Video Loading and Preprocessing

We’ll use supervision's download_assets function to fetch a sample video. This eliminates the need for a local video file.

# Download the sample grocery store video
download_assets(VideoAssets.GROCERY_STORE)
SOURCE_VIDEO_PATH = f"<Your_path>"

This will download the video and set video_path to the correct file path.

Next, we’ll load the video and prepare it for drawing polygons to define our target zone.

We will use OpenCV to open the video. We’ll use the first frame to draw a polygon representing the target zone where we’ll count objects.

# Open the video file
cap = cv2.VideoCapture(SOURCE_VIDEO_PATH)
ret, frame = cap.read()
if not ret:
    print("Failed to read the video")
    cap.release()
    cv2.destroyAllWindows()
    exit()

# Make a copy of the frame for polygon drawing
image = frame.copy()
cap.release()

We use cap.read() to capture the first frame. If reading fails, the code exits after releasing the resources.

Next, we'll allow users to draw a polygon by clicking points on the frame to define our target zone. This polygon will act as a boundary for tracking objects.

We will then define a callback function that captures mouse click events, storing each point clicked.

As points are added, we can use OpenCV to draw lines between them to create a polygon.

polygon_points = []  # Stores points for the polygon

# Define the mouse callback function for drawing the polygon
def draw_polygon(event, x, y, flags, param):
    global polygon_points, image
    if event == cv2.EVENT_LBUTTONDOWN:  # If left mouse button is clicked
        polygon_points.append((x, y))
        if len(polygon_points) > 1:
            cv2.polylines(image, [np.array(polygon_points, dtype=np.int32)], False, (0, 255, 0), 2)
        cv2.imshow('image', image)

Finally, we can display the Image and Capture Points. We will open a window and set the callback function. Users can press 'q' to finish drawing the polygon.

# Set up window and callback
cv2.namedWindow('image', cv2.WINDOW_NORMAL)
cv2.setMouseCallback('image', draw_polygon)
print("Draw a polygon by clicking points. Press 'q' to finish.")

# Show the image and enable drawing
while True:
    cv2.imshow('image', image)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cv2.destroyAllWindows()
print("Polygon points:", polygon_points)

The polygon points will be stored in polygon_points, defining the boundary for object counting. We can click on points, which will be connected to form a polygon that defines our target zone.

Fig2: Defining Target Zone with OpenCV

Step 3: Object Detection with YOLOv8 and Inference

Now that our target zone is defined let’s detect objects within the video frames using YOLOv8.

First, we need to load a YOLO model. For this example, we will load a YOLOv8x model trained using the Microsoft COCO dataset. You can use any YOLOv8 model you want with this guide.

model = get_model(model_id="yolov8x-640")

Supervision offers a convenient tool that converts the video into individual frames saved in a generator over which we can iterate to get frames:

generator = sv.get_video_frames_generator(SOURCE_VIDEO_PATH)
frame = next(generator)

Next, we can run the model to get bounding boxes around detected objects.

results = model.infer(frame)[0]
detections = sv.Detections.from_inference(results)

The line detections = sv.Detections.from_inference(results) converts the detection results from the YOLOv8 model into a format that the Supervision library can use.

The results variable contains the raw YOLOv8 predictions, including bounding boxes, confidence scores, and class IDs. The method sv.Detections.from_inference(results) converts these YOLO results into a Detections object compatible with the Supervision library, enabling further processing for tracking, annotation, and zone triggering.

Step 4: Object Tracking

Tracking helps maintain object identities across frames, which is crucial for counting objects entering or exiting the defined zone. As discussed above, we’ll use ByteTrack for tracking.

Load the byte tracker using the code below:

# Initialize ByteTrack for tracking
byte_tracker = sv.ByteTrack()

Each detected object is assigned a unique ID, allowing us to follow it across multiple frames. Now, we can add the tracking in the previous list of detections:

detections = byte_tracker.update_with_detections(detections)

Now, our detections contain the detected objects and their tracking.

Step 5: Object Counting

The main objective is to count objects entering or exiting the defined target zone. To achieve this, we’ll use the polygons that the user drew and send them to a Polygon Zone.

zone = sv.PolygonZone(polygon=polygon_points)
zone_annotator = sv.PolygonZoneAnnotator(zone=zone, color="white", thickness=6, text_thickness=6, text_scale=4)

The PolygonZone defines a specific area using the points we draw, creating a polygon that serves as the target zone. The PolygonZoneAnnotator then visualizes this zone on the video frames and updates the count of objects that enter or exit the defined area, helping track movements within the polygon.

On each frame, we’ll update the count by checking if objects are within the polygon.

zone.trigger(detections=detections)

The above code updates the object count every time a tracked object crosses into or out of the polygon area.

Complete Code

In this section, we’ll combine the Object detection logic.

def Person_counting(video_path, output_path, target_zone):
    generator = sv.get_video_frames_generator(video_path)
    frame = next(generator)
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')  # 'mp4v' for .mp4
    out = cv2.VideoWriter(output_path, fourcc, 20, (frame.shape[1], frame.shape[0]))

    byte_tracker = sv.ByteTrack()

    trace_annotator = sv.TraceAnnotator(thickness=4, trace_length=50)

    zone = sv.PolygonZone(polygon=target_zone)

    zone_annotator = sv.PolygonZoneAnnotator(zone=zone, color="white", thickness=6, text_thickness=6, text_scale=4)

    model = get_model(model_id="yolov8x-640")

    for frame in generator:

        BOUNDING_BOX_ANNOTATOR = sv.BoundingBoxAnnotator(thickness=2)
        LABEL_ANNOTATOR = sv.LabelAnnotator(text_thickness=2, text_scale=1, text_color=sv.Color.BLACK)


        # Model prediction on single frame and conversion to supervision Detections
        results = model.infer(frame)[0]


        # Convert the filtered detections to Supervision Detections
        detections = sv.Detections.from_inference(results)

        zone.trigger(detections=detections)

        # Tracking detections
        detections = byte_tracker.update_with_detections(detections)

        labels = [
            f"#{tracker_id} {model.model.names[class_id]} {confidence:0.2f}"
            for confidence, class_id, tracker_id
            in zip(detections.confidence, detections.class_id, detections.tracker_id)
        ]

        annotated_frame = trace_annotator.annotate(
            scene=frame.copy(),
            detections=detections
        )

        annotated_frame = BOUNDING_BOX_ANNOTATOR.annotate(annotated_frame, detections)
        annotated_frame = LABEL_ANNOTATOR.annotate(annotated_frame, detections, labels = labels)
        annotated_frame = fillPolyTrans(annotated_frame, [polygon], color=(0, 0, 255), opacity=0.5)
        annotated_frame = cv2.putText(annotated_frame, f'Person entered: {zone_annotator.zone.current_count}', (10, 70), cv2.FONT_HERSHEY_SIMPLEX, 2, (0, 0, 0), 2, cv2.LINE_AA)

        out.write(annotated_frame)  # Write the frame into the file 'output_path'


    out.release()
    cv2.destroyAllWindows()

Here’s a summary of the code above:

Video Frame Generation:

sv.get_video_frames_generator(video_path): Loads the video and generates frames for processing.
cv2.VideoWriter: Defines the output video file and sets video codec, frame rate, and resolution.

Object Detection (YOLOv8):

get_model(model_id="yolov8x-640"): Loads the YOLOv8x model for object detection.

Object Tracking:

sv.ByteTrack(): Tracks detected objects across frames, assigning unique IDs to each object.

Target Zone Definition:

sv.PolygonZone(polygon=target_zone): Defines a target area (polygon) where objects are counted as they enter.
sv.PolygonZoneAnnotator: Annotates the target zone on the video frame and keeps count of objects entering it.

Annotations and Display:

sv.BoundingBoxAnnotator and sv.LabelAnnotator: Adds bounding boxes and labels to the objects being tracked.
Displays the count of objects (people) entering the zone with cv2.putText.

Output Video:

Annotated frames with detections, labels, and zone counts are written into an output video file using cv2.VideoWriter.

Final Processing:

Once all frames are processed, the output video is saved, and resources are released with out.release() and cv2.destroyAllWindows().

There is a utility function to add opacity to the target zone for easier visibility. Here’s a code for it:

def fillPolyTrans(img, points, color, opacity):
    list_to_np_array = np.array(points, dtype=np.int32)
    overlay = img.copy()  # coping the image
    cv2.fillPoly(overlay,[list_to_np_array], color )
    new_img = cv2.addWeighted(overlay, opacity, img, 1 - opacity, 0)
    # print(points_list)
    img = new_img
    return img

Here are the results on an example video:

Conclusion

This guide demonstrated how to set up object detection, tracking, and counting using YOLOv8 and the Roboflow Supervision library. By combining YOLOv8 for detection and ByteTrack for tracking, we process video frames and track persons across multiple frames. We also count persons entering a defined zone in real-time.

With this technological foundation, many use cases can be implemented. Are you a developer who wants to learn more? Check out these resources to dive deeper into object detection and tracking:

An article on Launch: Use YOLO11 with Roboflow
A guide on how to track objects using ByteTrack
Another similar use case is Automatic Stop Sign Violation Detection