We are pretty excited about the Luxonis OpenCV AI Kit (OAK-D) device at Roboflow, and we're not alone. Our excitement has naturally led us to create another tutorial on how to train and deploy a custom object detection model leveraging Roboflow and DepthAI, to the edge, with depth, faster.

To illustrate the immensely wide open scope of applications this guide opens up, we tackle the task of realtime identification of American sign language.

American Sign Language alphabet.
American Sign Language alphabet.

Tutorial Outline

  • Gather and Label Images
  • Install MobileNetV2 Training Environment in the TensorFlow OD API
  • Download Custom Training Data From Roboflow
  • Train Custom MobileNetV2 object detection model
  • Run Test Inference to Check the Model's Functionality
  • Convert Custom MobileNetV2 TensorFlow Model to OpenVino and DepthAI
  • Run Our Custom Model on the edge with depth on the Luxonis OAK-D
This is an end to end computer vision tutorial

Data

Training

Model Conversion

Deployment

End to End

YouTube Discussion

Gather and Label Images

To get started with this guide, you will need labeled object detection images. Object detection labeling involves drawing a box around the object you want to detect and defining a class label for that object. The labeled images will provide supervision for our model to learn from.

If you would like to follow along with the tutorial, you can skip this step and fork the public American Sign Language Training Data.

To label your own dataset, at the current moment, our labeling tool of choice is CVAT:

This guide will also show you how to upload images to Roboflow after labeling.

Install MobileNetV2 Training Environment

Once you have your dataset labeled, it is time to jump into training!

To the Colab Notebook to Deploy Custom Model to OAK-D - we recommend having this up alongside this blog post. You will be able to run the cells in the notebooks sequentially, only changing one line of code for your dataset import.

To start off in this notebook, make a copy in your drive and make sure the GPU Runtime is enabled. Then, we will install dependencies for the TF OD library in the first few cells.

Download Custom Training Data From Roboflow

Whether you choose to fork the public sign language detection dataset, or bring your own, you will want to use Roboflow to get your data in the proper format for training, namely, TFRecord.

Choose data preprocessing and augmentation settings. Hit Generate. Then Donwload and choose Tensorflow TFRecord.

Acquiring a download link from Roboflow

Then we'll paste that download link into the Colab notebook to bring our data in.

Downloading data from Roboflow into the Colab notebook for training

Train Custom MobileNetV2 Object Detection Model

After we have our data in the notebook, we can proceed with training our custom model.

We will first download the pretrained MobileNetV2 weights to start from. Then we'll configure a few parameters in our training configuration file. You can edit these configurations to change based on your datasets needs. Most commonly, you may want to adjust the number of steps your model trains for. More steps takes longer but typically yields a more performant model.

Once the training configuration is defined, we can go ahead and kick off training.

Training a Custom MobileNetV2 model ⏰

Run Test Inference to Check the Model's Functionality

Before moving on to model conversion, it is wise to check our model's performance on test images. These are images that the model has never seen before.

Our model is making good test inference on test images - time to take it live

Convert Custom MobileNetV2 TensorFlow Model to OpenVino and DepthAI

Once you are satisfied with your models performance metrics and test inference capabilities it is time to take it live!

Before your model will be usable on the OAK-D Device, you will first need to convert it to a format that is compatible with the DepthAI framework.

The first step is to install OpenVino (Intel's edgeAI framework) and convert your model into an OpenVino Intermediate Representation.

Converting to OpenVino IR

From the OpenVino IR, we then send the model up to DepthAI's API to convert it to a .blob.

Download the .blob and put it somewhere accessible to the machine running your OAK device (AWS S3, USB stick, etc.) . Note: you will need a linux based environment to host the OAK device.

Deploy Our Custom Model to the Luxonis OAK-D

To deploy our custom model on the Luxonis OAK-D, we'll first need to clone the DepthAI repository and install the necessary requirements. Here is the documentation for setting up the DepthAI repo.

Plugging in the OAK-D and setting up custom weights in the DepthAI repository

Then, you can click in your OAK device and test the following command to see if your installs worked:

python3 depthai_demo.py -dd -cnn mobilenet-ssd

This runs the base mobilenet-sdd model that has been trained on the COCO dataset. To run our model, we'll leverage all the infra around this base model with our own twist.

Within the DepthAI repo, you will see a folder called resources/nn/mobilenet-sdd. To bring in our own model, we'll copy that folder to resources/nn/[your_model] and rename the three files to [your_model] in place of mobilenet-ssd.

Then we copy over our custom models weights in place of what once was mobilenet-sdd.blob.  Then we edit the two JSON files, to replace the default class names with our own class list. If you need to double check the order of your custom class lists, you can check the printout in the Colab notebook when we imported our data. After rewriting this files, we are ready to launch our custom model!

python3 depthai_demo.py -dd -cnn [your-model]

Live, realtime inference on the OAK-D Device (inferencing with X,Y, and Z coordinates)

You will see the video stream forming bounding boxes around the objects you trained the model to detect along with a depth measurement of how far away the object is from the camera.

Conclusion

Congratulations! Through this tutorial, you have learned how to span the long distance from collecting and labeling images to running realtime inference on device with a custom trained model.

We showed how to tackle the problem of identifying American Sign Language in this blog as a demonstration of the wide scope of custom tasks you can apply in your own domain with your dataset.

Good luck and as always, happy training.