Process and Mindset
Building a computer vision model is both an art and a science. While there are many pathways to creating a satisfactory model, some of these pathways contain pitfalls that can trip up experienced machine learning practitioners.
This guide is designed to help you dodge these pitfalls and ensure your model is ready for production as quickly as possible. Given the subjective nature of the decisions involved in this process, quick tests and frequent iteration are key to success. You don't want to spend ten hours labeling images before checking model performance.
Before we dive deep into our best practices for building and improving a computer vision model, here is a summary of what we will cover for quick reference:
- Remember the problem: Computer vision is just a tool to solve a real-world problem. Ensure you are evaluating success in terms of that problem, not accuracy scores like mAP.
- Keep it simple: Start with a small subset of data and the simplest version of your problem. This process allows for quick iteration - you can always add complexity as you gain confidence in your model.
- Focus on the data: Relevant data and the right labeling taxonomy will determine success, not a fancy model architecture or clever augmentations.
- Keep evaluating: Always check model performance in terms of the data your model will see in production - even after you’re in production!
Problem Selection and Scope
Before you spend time labeling images, you should first establish conviction that your problem can be solved through computer vision. A good rule of thumb: “if a human can detect an object or phenomena in an image, so can an AI model”.
Once you’re confident that computer vision is an appropriate method to solve your problem, the single most important decision you can make is scoping. While it’s tempting to try solving for every edge case at first, that approach will get you stuck. Keep it simple.
Here are a few examples you can use to evaluate how you should think about defining the scope for your project:
- Are you trying to detect 200 different types of fruit? Start simple by only detecting bananas and apples.
- Are you trying to read individual digits in a license plate? Start by ensuring you can detect the license plate first.
- Do you want to detect basketball players from TV footage? Start with images from one game before you try to generalize.
- In contrast, if you want to build a model to detect different components on a circuit board, starting with images that annotate 20 different components is counterproductive. Debugging your model will be more difficult.
Once you have built a model that solves a basic problem you are trying to solve, you can work toward adding new classes to build a complete solution to your problem.
Collect High Quality Data
Experts can argue over what architecture and hyper-parameters will result in “state of the art” performance, but the reality is that data quality is the single largest determinant of model quality. Focus on relevance, variance, and sample size when curating your first dataset.
Let’s talk about each of these factors.
Relevance is the degree to which your training data matches the data your model will encounter in production. Here are a few examples of how to consider relevance in the context of solving specific problems with computer vision:
- If you eventually plan to deploy your model from a drone-mounted camera, you should only incorporate drone images.
- Do you expect your model to work outside, both during the day and night, year round? You will need data across all of the permutations of season, time, and weather.
- Did you grab a bunch of photos of hammers on a white background to create a model to be used in your workshop? Make sure to add photos actually taken in a workshop.
Variance is the difference between the images in your dataset. The more varied the data your model will encounter in production, the more varied your dataset needs to be.
By adding variations in your training data, you will improve the ability for your model to generalize across different settings and contexts. If all of your images are extremely similar, you will likely end up with an overfit model. The model will perform well against similar images (and therefore have a high mAP score!), but it won’t generalize well outside of those images.
Importantly, your data needs to be relevant to what your model will encounter in production. If there is a high variance in the dataset with which you are working and the data is not representative of the environment in which your model will be deployed, your model will struggle to identify objects accurately.
Dataset size is the number of images in your dataset. Generally, more images will help improve model performance; each image gives the model more information about how to identify a class. But, images must be varied and representative of production data. This is key.
Even though it’s generally better to have thousands and thousands of images, in most cases it’s a mistake to start with a dataset that large. Depending on your use case, you can often get surprisingly good results with only 50-150 well-selected and well-labeled images (more on labeling in the next section). These early models will help you understand model performance at the class-level, give you future checkpoints to train from, and enable label-assist as you label more images. As an added bonus, it is far easier to make labeling changes to smaller datasets - you want to do this iteration before you’ve invested hours and dollars labeling thousands of images!
You will want to make sure there is a balance between the different classes in your dataset. If you only have a few examples of one class in your dataset but hundreds of examples of another class, the model will perform less effectively at identifying the class with fewer examples. Make sure to look at Health Check to ensure you don’t have any classes that are severely under-represented.
Looking for more data to jumpstart your project? Try searching on Roboflow Universe.
Creating computer vision models is both an art and a science. Labeling images is probably the element of the computer vision process that is the furthest on the “art” side of the spectrum.
“Labeling” is really a set of two problems: the strategic (“how should I design my class taxonomy to best solve my problem?”) and the tactical (“how should I actually label each image to ensure I’m getting the best model performance”).
On the strategic side, here are a few tips to keep in mind.
First, plan for the future. Think about the classes you want to predict. Choose specific classes that you are likely to use. You can always remap ‘poodle’ and ‘corgi’ into the ‘dog’ class, but it’s going to be painful to later split the ‘dog’ class into its component classes.
In cases where you have groups of classes where a) there is a lot of similarity within a group, and b) there are significant differences across groups and you are seeing poor performance training against each individual class, try a two-stage model. For example, if you’re trying to detect 20 models of bus and 20 models of sedan, first detect the group (‘bus’ or ‘sedan’), and then send that inference to a group-specific model.
If an object of interest is likely to be confused with an object that isn’t of interest, make sure each has a different class. The model will learn to distinguish between the two objects, improving performance.
On the tactical side, let’s talk through a few tips.
Second, labels should be as tight as possible around the object of interest.
Third, make sure to incorporate ‘null’ images in your dataset.
Finally, train models often to see how Label Assist performs against new images. This will not only speed up your labeling process, it will preemptively show you which classes your model performs the worst with. Here’s a more detailed guide for those looking to improve labeling accuracy.
Preprocessing and Augmentations
Preprocessing ensures all of your data is in a standard format, a prerequisite for the training process. Augmentations, on the other hand, only impact the training data and are a way to synthetically increase your sample size (therefore improving ‘mileage’ of your images).
When you’re starting work on a new model, you should only use the most basic preprocessing steps and ignore augmentations. This may seem unintuitive, but these steps are helpful for making a good model great.
We don’t recommend starting with augmentations without understanding the baseline performance of your model on your data. Some preprocessing steps are always helpful; we recommend Auto-Orient (this fixes orientation defaults with certain images) and resizing to a 416x416 square (computer vision models perform best on standardized squares).
If you find your model doesn’t perform well without augmentations, you know the problem lies in your source data; if you add augmentations to start, you don’t know what is the problem: the data or the augmentations you have chosen. You are more likely to ruin your model by over-saturating or over-blurring images such that they’re unusable for training.
When you get started, you have the option of using transfer learning and training from the COCO checkpoint or from no checkpoint. In almost all cases, the COCO checkpoint will result in substantially better performance (though it can be worth testing both options).
After you start to train models, it’s generally best to use your best-performing model as the checkpoint. Another reason to train early and often!
Testing and Evaluating Your Model
Evaluation is listed last in this list (as it’s the last step in training a model), but evaluation should be a part of every model iteration. Otherwise, there’s not much of a point in iterating. When you make a change, you should measure the impact and efficacy of the change.
You need to evaluate model performance in the context of the real-world problem you are trying to solve. To do this, use the Roboflow deploy tab to see how your model performs against other images. It’s helpful to start with images within the dataset, then similar images outside of the dataset, then images that might occur in production but are not well-represented in the dataset (edge cases).
We do not recommend relying solely on mAP for validation, although mAP is a useful metric. Relying on mAP in isolation presents problems, such as:
- In cases where you particularly care about 1-2 classes but not the others, mAP is not helpful as it combines information you care about with information you don’t care about.
- If you don’t have enough variance in your dataset, it will likely be overfit. You’ll have a high mAP but the model will not work outside of the existing test set.
- If you only have 2-5 images in your validation set, small differences in images can change mAP considerably.
If you suspect that some classes perform better than others, try using the ‘remap classes’ preprocessing steps to train models on each class of interest independently. This will give you a baseline sense of where your model needs the most improvement.
Once your model is working well, the best way to evaluate it is to put it into production using our hosted inference API. Run your model in parallel to an existing process or without any output to the rest of your system. This will 1) produce more production data, 2) show you ‘true’ performance relative to your business case, and 3) potentially show that an ‘ok’ model from a mAP perspective can solve your problem.
Building a computer vision model does not need to result in frustration or heartbreak. With a focus on the real-world problem you are trying to solve and an iterative mindset, you will make steady progress.