The computer vision community has converged on the metric mAP to compare the performance of object detection systems. In this post, we will dive into the intuition behind how mean Average Precision (mAP) is calculated and why mAP has become the preferred metric for object detection.
What is Object Detection?
Before we consider how to calculate a mean average precision, let's first clearly define the task it is measuring.
Object detection models seek to identify the presence of relevant objects in images and classify those objects into relevant classes. For example, in medical images, we might want to be able to count the number of red blood cells (RBC), white blood cells (WBC), and platelets in the bloodstream. In order to do this automatically, we need to train an object detection model to recognize each one of those objects and classify them correctly. (I did this in a Colab notebook to compare EfficientDet and YOLOv3, two state-of-the-art models for image detection.)
The models both predict bounding boxes which surround the cells in the picture. They then assign a class to each one of those boxes. For each assignment, the network models a sense of confidence in its prediction. You can see here that we have a total of three classes (RBC, WBC, and Platelets).
How should we decide which model is better? Looking at the image, it looks like EfficientDet (green) has drawn a few too many RBC boxes and missed some cells on the edge of the picture. That is certainly how it feels based on the looks of things - but can we trust an image and intuition? If so, by how much is it better? (Hint: it's not – skip to the bottom if you don't believe.)
It would be nice if we could directly quantify how each model does across images in our test set, across classes, and at different confidence thresholds. Enter mAP!
To understand mean average precision, we must spend some meaningful time with the precision-recall curve.
The Precision-Recall Curve
Precision is a measure of, "when your model guesses how often does it guess correctly?" Recall is a measure of "has your model guessed every time that it should have guessed?" Consider an image that has 10 red blood cells. A model that finds only one of these ten but correctly labels is as "RBC" has perfect precision (as every guess it makes – one – is correct) but imperfect recall (only one of ten RBC cells has been found).
Models that involve an element of confidence can tradeoff precision for recall by adjusting the level of confidence they need to make a prediction. In other words, if the model is in a situation where avoiding false positives (stating a RBC is present when the cell was a WBC) is more important than avoiding false negatives, it can set its confidence threshold higher to encourage the model to only produce high precision predictions at the expense of lowering its amount of coverage (recall).
The process of plotting the models precision and recall as a function of the model's confidence threshold is the precision recall curve. It is downward sloping because as confidence is decreased, more predictions are made (helping recall) and less precise predictions are made (hurting precision).
Think about it like this: if I said, "Name every type of shark," you'd start with obvious ones (high precision), but you'd become less confident with every additional type of shark you could name (approaching full recall with lesser precision). By the way, did you know there are cow sharks?
As the model is getting less confident, the curve is sloping downwards. If the model has an upward sloping precision and recall curve, the model likely has problems with its confidence estimation.
AI researchers love metrics and the whole precision-recall curve can be captured in single metrics. The first and most common is F1, which combines precision and recall measures to find the optimal confidence threshold where precision and recall produce the highest F1 value. Next, there is AUC (Area Under the Curve) which integrates the amount of the plot that falls underneath the precision and recall curve.
The final precision-recall curve metric is average precision (AP) and of most interest to us here. It is calculated as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.
Both AUC and AP capture the whole shape of the precision recall curve. To choose one or the other for object detection is a matter of choice and the research community has converged on AP for interpretability.
Measuring Correctness via Intersection over Union
Object detection systems make predictions in terms of a bounding box and a class label.
In practice, the bounding boxes predicted in the X1, X2, Y1, Y2 coordinates are sure to be off (even if slightly) from the ground truth label. We know that we should count a bounding box prediction as incorrect if it is the wrong class, but where should we draw the line on bounding box overlap?
The Intersection over Union (IoU) provides a metric to set this boundary at, measured as the amount of predicted bounding box that overlaps with the ground truth bounding box divided by the total area of both bounding boxes.
Picking the right single threshold for the IoU metric seems arbitrary. One researcher might justify a 60 percent overlap, and another is convinced that 75 percent seems more reasonable. So why not have all of the thresholds considered in a single metric? Enter mAP.
Drawing mAP precision-recall curves
In order to calculate mAP, we draw a series of precision recall curves with the IoU threshold set at varying levels of difficulty.
In my sketch, red is drawn with the highest requirement for IoU (perhaps 90 percent) and the orange line is drawn with the most lenient requirement for IoU (perhaps 10 percent). The number of lines to draw is typically set by challenge. The COCO challenge, for example, sets ten different IoU thresholds starting at 0.5 and increasing to 0.95 in steps of .05.
Finally, we draw these precision-recall curves for the dataset split out by class type.
The metric calculates the average precision (AP) for each class individually across all of the IoU thresholds. Then the metric averages the mAP for all classes to arrive at the final estimate. 🤯
Using Mean Average Precision (mAP) in Practice
I recently used mAP in a post comparing state of the art detection models, EfficientDet and YOLOv3. I wanted to see which model did better on the tasks of identifying cells in the bloodstream and identifying chess pieces.
After I had run inference over each image in my test set, I imported a python package to calculate mAP in my Colab notebook. And here were the results!
Evaluation of EfficientDet on cell object detection:
78.59% = Platelets AP
77.87% = RBC AP
96.47% = WBC AP
mAP = 84.31%
Evaluation of YOLOv3 on cell object detection:
72.15% = Platelets AP
74.41% = RBC AP
95.54% = WBC AP
mAP = 80.70%
So contrary to the single inference picture at the beginning of this post, it turns out that EfficientDet did a better job of modeling cell object detection! You will also notice that the metric is broken out by object class. This tells us that WBC are much easier to detect than Platelets and RBC, which makes sense since they are much larger and distinct than the other cells.
mAP is also often broken out into small, medium, and large objects which helps identify where models (and/or datasets) may be going awry.
Now you know how to calculate mAP and more importantly, what it means! Thanks for reading and may your mean average precisions reach ever skyward 🚀