How to Use OCR on Videos
Published Apr 1, 2024 • 3 min read

Optical Character Recognition, or OCR, can be an incredibly beneficial addition to any system, especially when integrated with computer vision. OCR and computer vision can work on images but could be even more powerful when used on videos and video streams.

0:00
/0:09

An annotated video of the final result of running OCR on video

In this article, you will learn how to use OCR models and combine them with computer vision tools to build applications that process videos

How to Use OCR on Videos

For static video files, you can run OCR on all the image frames within the video file. In this guide, we will identify the IDs of multimodal shipping containers in a shipping yard using this approach. 

In this example, we have a video of several shipping containers, each of which contains one or more identifying numbers. Using this object detection model from Roboflow Universe, we can identify the numbers of the ID using OCR.

First, let’s set up a callback function to use later so that we can run OCR on detections. For this example, we will use EasyOCR. Learn about other OCR options and how they perform in this blog post.

import easyocr
reader = easyocr.Reader(['en'])

def run_ocr(frame,box):
  cropped_frame = sv.crop_image(frame,box)  

  result = reader.readtext(cropped_frame,detail=0)

  text = "".join(result)
  text = re.sub(r'[^0-9]', '', text)

  return text

For our use case, we only want the numbers that make up the ID, so we replace anything other than numbers within our detected text.

Next, we use Supervision’s `process_video` function to detect, then run OCR on our detections in a two-step process. We can then annotate the frame in order to create an annotated video with the annotated text.

import supervision as sv

def video_callback(frame, i):
  detections = predict(frame)
  detections = detections[detections.class_id == 2]

  labels = []
  for detection in detections:
    detected_text = run_ocr(frame,detection[0])
    labels.append(detected_text)

  annotated_frame = frame.copy()
  annotated_frame = sv.BoundingBoxAnnotator().annotate(annotated_frame,detections)
  annotated_frame = sv.LabelAnnotator().annotate(annotated_frame,detections,labels)

  return annotated_frame

sv.process_video(VIDEO_PATH,"cargo_rawocr.mp4",video_callback)

With that, we get an annotated video with the detected text from each detection with a label.

0:00
/0:12

Apply Tracking to Identify Unique Items in Videos

Although it would be possible to run OCR on every frame, doing so could be unnecessarily inefficient and costly, as well as being not especially useful in production use cases. 

Building on what we did before, we will use the object detection model and OCR then incorporate object tracking with custom code to “link” each identified container with its corresponding ID and run OCR a few times, as opposed to several hundreds of times. 

We can also use the benefit of both having multiple IDs across frames of video to build consensus logic, ensuring accuracy. 

👍
To improve readability, some code that is used to produce the final output isn’t in this post. The full code is available in this Colab notebook.

Using the ByteTrack implementation in Supervision, we can track each object across the video frames in which they appear. We also created a helper class that keeps track of OCR recognitions, making sure that the final text is the one that is recognized most often across ten OCR attempts.

import supervision as sv
import cv2
import numpy as np

# Initalize ByteTrack
tracker = sv.ByteTrack()

# Consensus monitoring utility
container_ids_tracker = Consensus()

# Keeps track of the IDs of detected containers
container_ids = {}

# A callback function runs for each video frame
def video_callback(frame, i):
  detections = predict(frame)
  detections = tracker.update_with_detections(detections)

  relevant_detections = detections[(detections.class_id == 1) | (detections.class_id == 2)]
  container_detections = detections[detections.class_id==1]
  id_detections = detections[detections.class_id==2]

  for i_idx, id_detection in enumerate(id_detections):
      id_box = id_detection[0]
      for c_idx, container_detection in enumerate(container_detections):
          # If an ID is within a container, run OCR on it.
          if check_within(id_box, container_detection[0]):
              parent_container_id = container_detection[4]

              container_id_winner = container_ids_tracker.winner(parent_container_id)
              if container_id_winner: continue

              ocr_result = ocr(frame,id_box,id_detection[4])
              container_ids_tracker.add_candidate(parent_container_id,ocr_result)

  # Video annotation label code...  

  annotated_frame = frame.copy()
  annotated_frame = sv.BoundingBoxAnnotator().annotate(annotated_frame,relevant_detections)
  # More video annotation code...

  return annotated_frame

sv.process_video(VIDEO_PATH,"cargo_processed.mp4",video_callback)

Then, after running the code, we get a final annotated video, as well as text data on which containers were present, which could be helpful for yard management use cases.

0:00
/0:09

An annotated video of the final result of running OCR on video

Conclusion

In this guide, we covered how to use OCR in combination with computer vision on video, as well as showcasing a small portion of the potential that OCR, computer vision, and video can have when combined with additional tools like tracking. 

If you would like more information on how Roboflow can help your business in setting up a comprehensive computer vision system, contact our sales team.

Cite this Post

Use the following entry to cite this post in your research:

Leo Ueno. (Apr 1, 2024). How to Use OCR on Videos. Roboflow Blog: https://blog.roboflow.com/ocr-on-videos/

Discuss this Post

If you have any questions about this blog post, start a discussion on the Roboflow Forum.

Stay Connected
Get the Latest in Computer Vision First
Unsubscribe at any time. Review our Privacy Policy.

Written by

Leo Ueno
ML Growth Associate @ Roboflow | Sharing the magic of computer vision | leoueno.com