We are excited to announce that you can now track objects frame over frame in video and camera stream using the Roboflow Inference API and the open source zero shot object tracking repository, without having to train a separate classifier for your object track features.

Tracking individual fish in a video.

What is object tracking?

Object detection models are good at identifying objects frame over frame, but they do not provide any concept of object permanence between frames. If you want to count objects in a video or camera stream, you need to add in additional layer of intelligence with object tracking.

How Object Tracking Used to Work

Object tracking approaches use methods like deep sort to compare the similarity of objects to each other across frames. The similarity metric is calculated from a separate featurizer network - usually a classification model fine-tuned against object tracks. Each object is passed through the featurizer, and object features are compared in their "semantic" distance from one another.

The big downside of this approach is that you have to fine-tune this featurizer model for your dataset, which means gathering additional annotations of object tracks and running another training routine in addition to your object detection model. 👎

Object Tracking with Zero Shot CLIP

OpenAI's CLIP model was trained to be a zero shot image classifier, and has been shown to provide robust image features across domains. Checkout this blog where we test CLIP on flower classification.

The breakthrough in our zero shot object tracking repository is to use generalized CLIP object features, eliminating the need for you to make additional object track annotations and to train another model on your own. 👍

We have found empirically that CLIP works quite well, as objects in a track are very similar to one another. Furthermore, you can always improve track performance by increasing the processing frame rate.

Using The Roboflow Object Tracking Repository

The first step is to construct and custom object detection model in Roboflow with Roboflow Annotate and Roboflow Train.

Once your model has been trained, your model will post to an API for inference and you will receive your model endpoint and API key.

We highly recommend running this on GPU. If you don't have a GPU, check out Google Colab.

Then you can clone the zero-shot-object-tracking repo and install some dependencies.

git clone https://github.com/roboflow-ai/zero-shot-object-tracking
cd zero-shot-object-tracking
git clone https://github.com/openai/CLIP.git CLIP-repo
cp -r ./CLIP-repo/clip ./clip

pip install --upgrade pip
pip install -r requirements.txt

Then point the script to the video file you want to test on and provide your inference API endpoint and API key:

python clip_object_tracker.py --source data/video/cards.mp4 --url https://detect.roboflow.com/playing-cards-ow27d/1 --api_key ROBOFLOW_API_KEY

An example video will be generated containing the object tracks from your custom model.

Object Tracking on the Edge

If you need to do this kind of inference on the edge in realtime, we recommend de-risking your approach on example video and dropping us a line when you're ready to deploy.


If you have an object detection model, you can now use it with our zero shot object tracking repository to do object tracking - no additional modeling required.

A huge shoutout to Roboflow intern Maxwell Stone for bringing this idea into a reality.

We would also like to thank the deep sort authors and OpenAI for open sourcing CLIP, and the deep sort authors for their great work.

Happy training and more importantly, happy tracking!