Evaluating Object Detection Models with mAP by Class
When evaluating an object detection model in computer vision, mean average precision is the most commonly cited metric for assessing performance. Remember, mean average precision is a measure of our model's ability to correctly predict bounding boxes at some confidence level – commonly mAP@0.5 or mAP@0.95. Those values refer to what percentage of area a predicted box overlaps with the ground truth bounding where if the area of a predicted box perfectly overlaps with the ground truth label, the intersection over union (IoU) is 1.0
Roboflow has previously broken down how to calculate mean average precision in depth. We've even recorded a video explainer:
When we consider a model's overall mAP in a problem with multiple different objects, we are averaging each individual object class mean average precision into a single value. This can hide performance issues if one of our classes, in particular, requires specific attention. Identifying the weakest predictions from our model creates opportunities for us to best improve it.
For example, imagine we're building an object detection model to identify individual chess pieces on a chess board using the chess pieces dataset. Determining which pieces our model is performing worst on guides our future data collection: if our model is most struggling to identify, say, black bishops, we should focus on collecting more images of the black bishop piece on the board.
Introducing mAP by Class in Roboflow Train
Roboflow Train automatically trains a model to your dataset – without having to do any GPU configuration or model architecture selection.
Roboflow Train now also provides validation set and training set evaluation, including mAP by class, so you can best identify where to focus your efforts to improve your models.
Let's return to our chess pieces dataset problem. Using Roboflow Train, I've trained a model. I see the following evaluation metrics:
Overall, my model does quite well! It achieves 98.7% mAP across all classes. Let's dig into individual classes in the validation and testing set.
One thing I can immediately see: my model does a great job with the pawn pieces. Both black and white pawns achieve 99 or 100 average precision in the validation set and the testing set.
On the other hand, a few other pieces demonstrate room for improvement. The white-rook piece is only 91% in the testing set. The white-knight piece is only 95% in the validation set. The white-queen piece is only 94% in the test set. Individually, these values are not so bad – but they are all below the mAP for the overall dataset.
Intuitively, this does make some sense: checking my Dataset Health Check, I see my pawn pieces have far more examples than any other piece:
I now have validation that my under represented classes do, indeed, need additional examples – though not necessarily the most under represented ones are worst!
Using mAP by Class in Roboflow Train
Roboflow Train, by default, provides mAP by class evaluations. To use Roboflow Train on your dataset, simply click "Use Roboflow Train" on any version you create. (Roboflow Train is a part of any Roboflow Pro plan.)
As always, happy training!