Launch: Dataset Search

Roboflow Annotate has been used to manage and label 90,000 datasets containing 66 million images and starting today you can now use text based search queries to better understand datasets in Roboflow.

Dataset Search enables text based search queries for datasets

Using text based search gives you a new capability to understand a dataset in ways beyond class balance, null or missing annotations, object count histograms, and more. Dataset Search uses computer vision to help you filter and find images using common language instead of relying on your annotation data or pre-trained assisted labeling models.

Dataset Search is an important tool for a data-centric approach to deploying high performing computer vision models into production. Employing active learning with a few lines of python to sample images and train the next version of your model is a common data-centric approach to improving your computer vision model.

The Roboflow Dataset search feature now has a few image metadata filters you can use to refine your search when looking for images in your dataset.

Here are the filters available:

like-image:<SOURCE_ID>: Sort by semantic similarity measured by CLIP.
tag : Filter by user-provided tags.
filename : Runs a search for file names that fully match the provided file name. Use * at the beginning and end of a query to run a partial match.
split : Filters by split (train, test, valid).
job:<JOB_ID> : Shows images with the provided job ID.
min-width:X : Shows images with a width less than X.
max-width:X : Shows images with a width greater than X.
min-height:X : Shows images with a height less than X.
max-height:X : Shows images with a height greater than X.
min_annotations:X : Filters images with fewer than the specified number of annotations.
max_annotations:X : Shows images with more than the specified number of annotations.
classname:CLASS: Shows images that do not have any annotations with the provided label.

Let's walk through a few examples you can use immediately to improve or understand your dataset.

Adding a New Class to Your Dataset

It's common to start training a model on a few objects and then deploy a fairly narrow model into production. Once a model is performing and adding value, you may want to expand the objects your model detects by adding an additional class.

Let's take a self driving car dataset for example. This dataset focuses on traffic lights, pedestrians, bikes, and cars. If we wanted to add a new class for stop signs and didn't want to collect an entirely new set of data to represent that class, we can search our current dataset to find and label stop signs we already have in our dataset.

0:00

Finding unlabeled stop signs within a self driving car dataset

Adding stop signs is a straightforward example because stop signs are obvious road-related objects. We can also search for less data such as cross walks, mailboxes, overpasses, and gravel.

Identifying unique elements present within images

Using the same dataset to add new objects or classification types helps save time and make the most of your data. You'll be able to quickly expand how your model can be used whenever you find additional use cases based on a given dataset.

Improve Accuracy in Low Performing Environments

Once your model is deployed into production, edge cases can start to arise and you'll discover situations where your model is not performing at the level needed for a given task. Often, the best solution to increase performance in these scenarios is to increase the representation of those images in your dataset.

For models deployed in outdoor environments, weather can cause performance issues. Sourcing data in unique elements is also difficult because it may not happen often. In this case, you can search for images related to those elements rather than specific objects to see which ones are underrepresented. Let's use this excavator dataset as an example and look at the difference in representation of snow and nighttime images.

Finding images in settings with low confidence inferences or high errors rates

We used weather and lighting as examples but you can apply this to other differences in images like aerial capture, wide lens, black and white, thermal imaging, and any other aspects that your model is not handling well.

Not all datasets have common objects which can cause an issue when trying to search for specific data within images. Dataset Search is helpful here as well. Let's take an electronic components dataset to showcase how this can work and again see the difference in representation of images.

Using searches not related to objects to locate data

Using any words to describe the images you'd like to locate will return results to help you find images even when common objects are not visible.

Find Missing Labels and Verify Label Quality

In situations where you're using a class for an object but are seeing low confidence with inferences, you may want to explore that object to see if the annotations could be improved, if objects are mislabeled, if the object is visible but not labeled, or if the data is not representative of the real-world scenario.

To highlight some of these examples, we can use a large dataset like COCO and see if the labels could be improved.

0:00

Finding labeling inconsistencies within a dataset

You can see there is inconsistency with labeling the graffiti toasters, but you can also imagine that labeling a painted toaster may cause a model to be less performant for a real toaster.

0:00

Discovering differences in labeling within a dataset

This search query highlights more examples of inconsistent labeling and labeling data that may not be representative of what you want your model to predict. Action figures being labeled as people might be exactly what you want and Dataset Search can help make sure those labels are consistent across images.

Search by Filename or Within a Split Group

Narrow your search to find specific data by using keywords in the filename (or the exact filename) and/or within a Split in your dataset.

Searching by filename and within a specific Split

Filename is helpful if your data is labeled with information relevant to what you're identifying such as dates, timestamps, locations, camera types, classes, or objects.

Better Understanding Datasets Using Search

Dataset Search is a new powerful way for you to gain insight into the data used to train your model. Use Dataset Search alongside Dataset Health Check to take a data-centric approach to improving the outcomes of your next computer vision project.

You can use Dataset Search today to explore the 90,000+ datasets in Roboflow Universe (go into any Project and click Images) or create a free account to upload your own dataset and explore your images via the Datasets tab.

Cite this Post

Use the following entry to cite this post in your research:

Trevor Lynn. (Aug 17, 2022). Launch: Dataset Search. Roboflow Blog: https://blog.roboflow.com/dataset-search/

Discuss this Post

If you have any questions about this blog post, start a discussion on the Roboflow Forum.

Launch: Dataset Search

Adding a New Class to Your Dataset

Improve Accuracy in Low Performing Environments

Find Missing Labels and Verify Label Quality

Search by Filename or Within a Split Group

Better Understanding Datasets Using Search

Cite this Post

Discuss this Post

Trevor Lynn

Table of Contents

MORE ABOUT

Dataset Management

What is the Open Images Dataset? A Deep Dive.

How to Train RT-DETR on a Custom Dataset with Transformers

Import Images from Databricks to Roboflow

Using YOLO-World With Active Learning to Train a Custom Model

Build Enterprise Datasets with CLIP for Multimodal Model Training Using Intel Gaudi2 HPUs

How to Use Multiple Models to Label Datasets with Autodistill