On June 25th, the first official version of YOLOv5 was released by Ultralytics. In this post, we will discuss the novel technologies deployed in the first YOLOv5 version and analyze preliminary performance results of the new model.
In the chart, the goal is to produce an object detector that is very performant (Y-axis) relative to it's inference time (X-axis). Preliminary results show that YOLOv5 does exceedingly well to this end relative to other state of the art techniques.
In summary, YOLOv5 derives most of its performance improvement from PyTorch training procedures, while the model architecture remains close to YOLOv4.
In this article we hope to address some of the following questions about YOLOv5:
- what's new in YOLO v5?
- what's the new technology in YOLO v5?
- how does YOLO v5 compare to YOLO v4?
- what's different between YOLO v4 and YOLO v5?
- should I use YOLO v4 or YOLO v5 for object detection?
- is YOLO v5 "real"?
A Development History of YOLOv5
History of YOLOs
For a deep dive on a history of YOLOs I recommend reading this thorough breakdown of YOLOv4. In short, the YOLO model is a fast compact object detection model that is very performant relative to its size and it has been steadily improving.
Extension of YOLOv3 PyTorch
The YOLOv5 repository is a natural extension of the YOLOv3 PyTorch repository by Glenn Jocher. The YOLOv3 PyTorch repository was a popular destination for developers to port YOLOv3 Darknet weights to PyTorch and then move forward to production. Many (including our vision team at Roboflow) liked the ease of use the PyTorch branch and would use this outlet for deployment.
After fully replicating the model architecture and training procedure of YOLOv3 Ultralytics began to make research improvements alongside repository redesign with the goal of empowering thousands of developers to train and deploy their own custom object detectors to detect any object in the world, a goal we share here at Roboflow.
New YOLOv5 Repository Progress
These advancements were originally termed YOLOv4 but due to the recent release of YOLOv4 in the Darknet framework, to avoid version collisions, it was renamed to YOLOv5. There was quite a bit of debate around the YOLOv5 naming in the beginning and we published an article comparing YOLOv4 and YOLOv5, where you can run both models side by side on your own data. We abstain from custom dataset comparisons in this article and just discuss the new technologies and metrics that the YOLO researchers are publishing on GitHub discussions.
It is worth noting, since the repository was published, significant research progress has occurred in YOLOv5, which we expect to continue, and might give some justification to the YOLO-"moniker".
An Overview of the YOLO Architecture
An object detector is designed to create features from input images and then to feed these features through a prediction system to draw boxes around objects and predict their classes.
The YOLO model was the first object detector to connect the procedure of predicting bounding boxes with class labels in an end to end differentiable network.
The YOLO network consists of three main pieces.
1) Backbone - A convolutional neural network that aggregates and forms image features at different granularities.
2) Neck - A series of layers to mix and combine image features to pass them forward to prediction.
3) Head - Consumes features from the neck and takes box and class prediction steps.
Of course, there are many approaches one can take to combining different architectures at each major component. The contributions of YOLOv4 and YOLOv5 are foremost to integrate breakthroughs in other areas of computer vision and prove that as a collection, they improve YOLO object detection.
An Overview of YOLO Training Procedures
Training procedures are equally important to the end performance of an object detection system, though they are often less discussed.
Data Augmentation - Data augmentation makes transformations to the base training data to expose the model to a wider range of semantic variation than the training set in isolation.
Loss Calculations - YOLO calculates a total loss function from constituent loss functions - GIoU, obj, and class losses. These can be carefully constructed to maximize the objective.
The largest contribution of YOLOv5 is to translate the Darknet research framework to the PyTorch framework. The Darknet framework is written primarily in C and offers fine grained control over the operations encoded into the network. In many ways the control of the lower level language is a boon to research, but it can make it slower to port in new research insights, as one writes custom gradient calculations with each new addition.
The process of translating (and exceeding) the training procedures in Darknet to PyTorch in YOLOv3 is no small feat.
Data Augmentation in YOLOv5
Here is a picture of augmented training images in YOLOv5.
With each training batch, YOLOv5 passes training data through a data loader, which augments data online. The data loader makes three kinds of augmentations: scaling, color space adjustments, and mosaic augmentation. The most novel of these being mosaic data augmentation, which combines four images into four tiles of random ratio.
The mosaic data loader is native to the YOLOv3 PyTorch and now YOLOv5 repo.
Mosaic augmentation is especially useful for the popular COCO object detection benchmark, helping the model learn to address the well know "small object problem" - where small objects are not as accurately detected as larger objects.
It is worth noting that it worth experimenting with your own series of augmentations to maximize performance on your custom task.
Auto Learning Bounding Box Anchors
In order to make box predictions the YOLO network predicts bounding boxes as deviations from a list of anchor box dimensions.
In the YOLOv3 PyTorch repo, Glenn Jocher introduced the idea of learning anchor boxes based on the distribution of bounding boxes in the custom dataset with K-means and genetic learning algorithms. This is very important for custom tasks, because the distribution of bounding box sizes and locations may be dramatically different than the preset bounding box anchors in the COCO dataset.
The most extreme difference in anchor boxes may occur if we are trying to detect something like giraffes that are very tall and skinny or manta rays that are very wide and flat.
All YOLO anchor boxes are auto-learned in YOLOv5 when you input your custom data.
16 Bit Floating Point Precision
The PyTorch framework allows the ability to half the floating point precision in training and inference from 32 bit to 16 bit precision. This significantly speeds up the inference time of the YOLOv5 models.
However, the speed improvements of this improvement are only available on select GPUs at this point - namely, V100 and T4. That said, NVIDIA has written intent to improve their coverage of this efficiency boost.
New Model Configuration Files
YOLOv5 formulates model configuration in
.yaml, as opposed to
.cfg files in Darknet. The main difference between these two formats is that the
.yaml file is condensed to just specify the different layers in the network and then multiplies those by the number of layers in the block. The new
.yaml format looks like the following:
# parameters nc: 80 # number of classes depth_multiple: 0.33 # model depth multiple width_multiple: 0.50 # layer channel multiple # anchors anchors: - [116,90, 156,198, 373,326] # P5/32 - [30,61, 62,45, 59,119] # P4/16 - [10,13, 16,30, 33,23] # P3/8 # YOLOv5 backbone backbone: # [from, number, module, args] [[-1, 1, Focus, [64, 3]], # 0-P1/2 [-1, 1, Conv, [128, 3, 2]], # 1-P2/4 [-1, 3, BottleneckCSP, ], [-1, 1, Conv, [256, 3, 2]], # 3-P3/8 [-1, 9, BottleneckCSP, ], [-1, 1, Conv, [512, 3, 2]], # 5-P4/16 [-1, 9, BottleneckCSP, ], [-1, 1, Conv, [1024, 3, 2]], # 7-P5/32 [-1, 1, SPP, [1024, [5, 9, 13]]], ] # YOLOv5 head head: [[-1, 3, BottleneckCSP, [1024, False]], # 9 [-1, 1, Conv, [512, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 6], 1, Concat, ], # cat backbone P4 [-1, 3, BottleneckCSP, [512, False]], # 13 [-1, 1, Conv, [256, 1, 1]], [-1, 1, nn.Upsample, [None, 2, 'nearest']], [[-1, 4], 1, Concat, ], # cat backbone P3 [-1, 3, BottleneckCSP, [256, False]], [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]], # 18 (P3/8-small) [-2, 1, Conv, [256, 3, 2]], [[-1, 14], 1, Concat, ], # cat head P4 [-1, 3, BottleneckCSP, [512, False]], [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]], # 22 (P4/16-medium) [-2, 1, Conv, [512, 3, 2]], [[-1, 10], 1, Concat, ], # cat head P5 [-1, 3, BottleneckCSP, [1024, False]], [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]], # 26 (P5/32-large) [, 1, Detect, [nc, anchors]], # Detect(P5, P4, P3) ]
Both YOLOv4 and YOLOv5 implement the CSP Bottleneck to to formulate image features - with the research credit directed to WongKinYiu and their recent paper on Cross Stage Partial Networks for convolutional neural network backbone. The CSP addresses duplicate gradient problems in other larger ConvNet backbones resulting in less parameters and less FLOPS for comparable importance. This is extremely important to the YOLO family, where inference speed and small model size are of utmost importance.
The CSP models are based on DenseNet. DenseNet was designed to connect layers in convolutional neural networks with the following motivations: to alleviate the vanishing gradient problem (it is hard to backprop loss signals through a very deep network), to bolster feature propagation, encourage the network to reuse features, and reduce the number of network parameters.
In CSPResNext50 and CSPDarknet53, the DenseNet has been edited to separate the feature map of the base layer by copying it and sending one copy through the dense block and sending another straight on to the next stage. The idea with the CSPResNext50 and CSPDarknet53 is to remove computational bottlenecks in the DenseNet and improve learning by passing on an unedited version of the feature map.
Both YOLOv4 and YOLOv5 implement the PA-NET neck for feature aggregation.
Each one of the P_i above represents a feature layer in the CSP backbone.
The above picture comes from research done by Google Brain on the EfficientDet object detection architecture. The EfficientDet authors found BiFPN to be the best choice for the detection neck, and it is may be an area of further steady for YOLOv4 and YOLOv5 to explore with other implementations here.
It is certainly worth noting here that YOLOv5 borrows research inquiry from YOLOv4 to decide on the best neck for their architecture. YOLOv4 investigated various possibilities for the best YOLO neck including:
General Quality of Life Updates for Developer
Compared to other object detection frameworks, YOLOv5 is extremely easy to use for a developer implementing computer vision technologies into an application. I categorize these quality of life updates into the following.
- Easy Install - YOLOv5 only requires the installation of torch and some lightweight python libraries.
- Fast Training - The YOLOv5 models train extremely quickly which helps cut down on experimentation costs as you build your model.
- Inference Ports that work - You can infer with YOLOv5 on individual images, batch images, video feeds, or webcam ports.
- Intuitive Layout - File folder layout is intuitive and easy to navigate while developing
- Easy Translation to Mobile - You can easily translate YOLOv5 from PyTorch weights to ONXX weights to CoreML to IOS.
Preliminary Evaluation Metrics
The evaluation metrics presented in this section are preliminary and we can expect a formal research paper to be published on YOLOv5 when the research work is complete and more novel contributions have been made to the family of YOLO models. That said, it is useful to provide these metrics for a developer who is considering which framework to use today, before the research papers have been published.
The evaluation metrics below are based on performance on the COCO dataset which contains a wide range of images containing 80 object classes. For more detail on the performance metric, see this post on what is mAP.
The official YOLOv4 paper publishes the following evaluation metrics running their trained network on the COCO dataset on a V100 GPU:
With the initial release of the first YOLOv5 V1 model, the YOLOv5 repository published the following:
These graphs invert the X-AXIS - FPS vs ms/img, but we can quickly invert the YOLOv5 axis to estimate FPS numbers around 200-300FPS on the same V100 GPU, while achieving higher mAP.
It is also important to note here the new release of YOLOv4-tiny a very small and very performant model in the Darknet Repository.
The evaluation metrics for YOLOv4-tiny read:
Which means it is very fast and very performant. But the important thing to notice here is that the evaluation metric is AP_50 - which means the average precision at 50% iOU. With this more lenient metric in mind, we must compare against the full table for YOLOv5:
Where we can see that the YOLOv5s (a similar model in speed and model size) achieves 55.8 AP_50.
The comparison is a little more complicated here due to the fact that the YOLOv4-tiny model is evaluated on a 1080Ti which is maximum 2X slower than the V100 used in the YOLOv5 table.
Needless to say, there will be more narrowly matched benchmarks to come and some are underway in this GitHub issue. WongKinYiu, author of the CSP repo above and second author of YOLOv4, provides comparable benchmarks.
From this point of view, YOLOv4 emerges as the superior architecture. It is worth noting however, that in this comparison, YOLOv4 is trained in the Ultralytics YOLOv3 repository (not the native Darknet) including most of the training enhancements in the YOLOv5 repository, showing mAP improvements.
More to come here I am sure.
Stepping back, it is a great time to be working on computer vision, where the state of the art is advancing so rapidly.
The initial release of YOLOv5 is very fast, performant, and easy to use. While YOLOv5 has yet to introduce novel model architecture improvements to the family of YOLO models, it introduces a new PyTorch training and deployment framework that improves the state of the art for object detectors. Furthermore, YOLOv5 is very user friendly and comes ready to use on custom objects "out of the box".
If you are interested in using either of the state of the art YOLO models to train a custom detector, we encourage you to view either of the following two guides in Google Colab:
Good luck detecting!