How to Build a Custom Open Images Dataset for Object Detection

We are excited to announce integration with the Open Images Dataset and the release of two new public datasets encapsulating subdomains of the Open Images Dataset: Vehicles Object Detection and Shellfish Object Detection.

In this post, we will walk through how to make your own custom Open Images dataset.

Vehicles and Shellfish are just a small window into the vast landscape of the Open Images dataset and are meant to provide small examples of datasets that you could construct with Open Images.

The vast array of subdomains in the Open Images Dataset. How do we make this tractable?

About Open Images

Open Images is an open source computer vision object detection dataset released by Google under a CC BY 4.0 License. The dataset contains a vast amount of data spanning image classification, object detection, and visual relationship detection across millions of images and bounding box annotations. The Open Image dataset provides a widespread and large scale ground truth for computer vision research.

Why Create A Custom Open Images Dataset?

The uses for creating a custom Open Images dataset are many:

Remember this is all free, labeled computer vision data that lives in the creative commons.

The Open Images Query Tool

The whole Open Image Dataset is halfway to a terabyte... and to download it raw, you will be running some commands such as:

aws s3 --no-sign-request sync s3://open-images-dataset/train [target_dir/train] (513GB)
aws s3 --no-sign-request sync s3://open-images-dataset/validation [target_dir/validation] (12GB)
aws s3 --no-sign-request sync s3://open-images-dataset/test [target_dir/test] (36GB)
The massiveness of Open Images (source)

Luckily, the open source community has created tools that make querying the Open Images database easy to use. In order to construct our custom Open Images datasets, we used the OIDv4_ToolKit. The OIDv4_ToolKit makes it easy for you to query subdomains of the OID and limit to specific classes. Simply with one line of python, you can specify the class and number of images you want. And it comes down with bounding boxes and everything!

python3 main.py downloader -y --classes Lobster --Dataset Lobster  --type_csv train --image_IsGroupOf 0 --n_threads 4 --limit 200
Downloading 200 labeled lobsters from Open Images
Downloading annotated crabs from Open Images

Converting Open Images Annotation Formats

We are excited to announce that we now support Open Images data formats at Roboflow. When you download the Open Images data, you will receive a large intractable CSV file containing all of the annotations in the entire dataset along with a class map. You will also recieve .txt files for annotations for each image that are much more tractable. We support both of these formats but I recommend using the .txt files.

In order to convert your annotations into any format, you simply make a free account with Roboflow and drag your images into the data upload flow.

Upload Open Images data for conversion to Roboflow

Once your dataset is created, you will be able to export in any format you desire. To name a few you will be able to:

Then you can train your custom detector with whichever model you like! At the time of writing this I am mostly training YOLOv5 detectors.

You can also merge your new custom dataset with another one of your datasets to increase coverage.

Introducing Roboflow's Public Custom Open Images Datasets

We have created two public custom Open Images datasets and shared among our public datasets: Vehicles Object Detection and Shellfish Object Detection.

Shellfish Object Detection class distribution
Shellfish Object Detection example images
Vehicles Object Detection class distribution
Vehicles Object Detection example images

The have been shared for public use on our public computer vision datasets.

Conclusion

Now you know how to construct a custom Open Images dataset using completely free computer vision data and open source tools.

We look forward to seeing what you build with Open Images! 🚀

If you are interested in scaling up these datasets or working on creating your own, please drop us a line!