As their projects mature and dataset sizes grow, most teams wrestle with label and class management. Slicing and dicing data is more of an art than a science and you will want to experiment with what works best for your problem over time (and, in fact, you will probably go through this process several times over the course of a project as you expand the scope and continue to redefine what "best" means).

Traditionally, this has been a tough nut to crack for computer vision because choosing class names is one of the first things you do, at the labeling step. Changing things later often means going back and re-labeling your images which is time consuming and expensive.

After working through this project, Roboflow has come up with some good rules of thumb to make the iteration process less painful and some tools to help you experiment more quickly.

Tips for Labeling Your Images

Roboflow recommends the following best practices for ontology management during the labeling phase:

  • Label the most specific classes you can think of. It's much easier to combine classes into more broad categories than it is to split a class into more specific segments. For example, if you are creating a dataset for a cleaning robot, you'll want to create labels for dining table, coffee table, and end table instead of a single class for table because if you someday needed your robot to behave differently for different table types you would have to pay someone to go back and categorize each table later vs simply merging the classes with a tool like Roboflow.
  • Ensure all of your labelers are on the same page. For instance, if you have a category in your self driving car dataset for semi-truck, does the whole vehicle including the trailer get labeled as a semi-truck or only the cab? Either way is valid but it's important that they're all labeled the same way.
  • Ensure you have some null examples. When your model is deployed to the real world it won't always find the things it's looking for. If you're training a raccoon detector and only feed it photos of raccoons, it's going to learn that everything is a raccoon because it doesn't know that dogs, cats, and people exist.
  • But not too many null examples. Similarly, if you have too many null examples, your model may learn that its best strategy to minimize its loss function is to never predict anything!
  • Make sure you have examples with multiple classes present if that is a situation that may be encountered in the real world. For example, if you're collecting images of fish to survey a coral reef, you'll want to ensure that you have examples with multiple fish rather than a single fish in each image.

Using Roboflow to Help

Luckily, Roboflow Pro has several features built-in to help with label management and ontology and our advanced preprocessing and augmentation features can help correct issues with too many or too few null images, and ensuring you have a broad mix of images with multiple classes present.

Fixing Typos and Naming Conflicts

Let's face it, sometimes our labelers make mistakes. Maybe you ended up with few a turck or Truck labels amidst a sea of trucks. Or maybe one of your annotators is British and calls a stroller a pram or a carriage. Or perhaps you only ended up with a few yellow lights and want to remove them altogether.

Traditionally, you'd have to go through and filter or correct these by hand. Roboflow makes it easy. Just add the "Modify Classes" preprocessing step and choose your settings.

Removing under-represented classes and fixing typos in a dataset.

Combining Classes

Oftentimes you will want to lump classes together into one category. For example, if you want your model to treat jet-ski, car and boat all as a vehicle in your drone dataset, you can do that with the Modify Classes preprocessing step as well. Unlike doing this yourself, Roboflow's preprocessing steps are non-destructive so you can always go back to your original ontology later if you decide you actually do need to distinguish between boats and jetskis.

Splitting a Dataset or Extracting a Single Class

Sometimes you want to train a specific model on just one part of your dataset. For instance, if I have a coral reef dataset but only want to train a model on images of sharks from that dataset, I can turn off all the other classes and extract just the sharks.

Turning off all the other labels

I can then use the Filter Null preprocessing step to remove all of the images that are now unannotated because they don't contain a shark.

Removing all images that now have no labels because we omitted the bounding boxes for non-sharks.

Achieving an Appropriate Number of Null Images

Many datasets contain an overwhelming number of null images. For example, with aerial imagery it is common to tile your image into several smaller images, most of which may be empty grassland or forest.

You'll want your model to learn that there are some images without a subject present but you don't want it to err on the side of predicting null for every image. So you can let some of the null images through while filtering out most of them.

Choosing the optimal amount of null images to allow is a bit of an art and Roboflow makes it easy to experiment with multiple values.

Synthesizing Images with Multiple Subjects

Roboflow Pro includes an advanced augmentation called Mosaic. You can read more in the linked blog post, but essentially it combines multiple images from your dataset into a single one. This can help to mix and match subjects when you have only single-class or single-subject images.

Mosaic on a Fish dataset synthesizes images with multiple different types of fish.

Wrapping it All Up

Every project is different and you'll want to think deeply and experiment with what works best for yours. Roboflow will help streamline this experimentation so you can train better models faster. Get in touch and we'd love to walk you through how you can best integrate Roboflow Pro's ontology management and advanced augmentation features into your team's workflow.