YOLO (You Only Look Once) is a family of computer vision models that has gained significant fanfare since Joseph Redmon, Santosh Divvala,  Ross Girshick, and Ali Farhadi introduced the novel architecture in 2016 at CVPR – even winning OpenCV's People Choice Awards.

OpenCV CEO Satya Mallick and Roboflow CEO Joseph Nelson discuss all the architectures in the YOLO family of models.

But what exactly is YOLO (besides an outdated Drake song)? Where did it come from, why is it novel, and  why do there seem to be so many versions?

The Origins of YOLO: You Only Look Once

The original YOLO (You Only Look Once) was written by Joseph Redmon in a custom framework called Darknet. Darknet is a very flexible research framework written in low level languages and has produced a series of the best realtime object detectors in computer vision: YOLO, YOLOv2, YOLOv3, and now, YOLOv4.

The original YOLO model was the first object detection network to combine the problem of drawing bounding boxes and identifying class labels in one end-to-end differentiable network.

Some object detection models treat the detection pass as a two-part problem. First, identify the regions of interest (bounding boxes) for where an object is present. Second, classify that specific region of interest. This is considered a two-stage detector, and popular models like Faster RCNN leverage this approach.

YOLO, on the other hand, is a single stage detector, handling both the object identification and classification in a single pass of the network. YOLO is not the only single stage detection models (e.g. MobileNetSSDv2 is another popular single shot detector), but it is generally more performant in terms of speed and accuracy.

In treating the detection task as a single shot regression approach for identifying bounding boxes, YOLO models are often very fast and very small – often making them faster to train and easier to deploy, especially to compute-limited edge devices.

YOLOv2 and YOLOv3

The YOLO family of models has continued to evolve since the initial release in 2016.

Notably, YOLOv2 and YOLOv3 are both by Joseph Redmon. YOLO models after YOLOv3 are written by new authors and – rather than being considered strictly sequential releases to YOLOv3 – have varying goals based on the authors' whom released them.

YOLOv2 was released in 2017, earning an honorable paper mention at CVPR 2017. The architecture made a number of iterative improvements on top of YOLO including BatchNorm, higher resolution, and anchor boxes.

YOLOv3 was released in 2018. (The YOLOv3 paper is perhaps one of the most readable papers in computer vision research given its colloquial tone.) YOLOv3 built upon previous models by adding an objectness score to bounding box prediction, added connections to the backbone network layers, and made predictions at three separate levels of granularity to improve performance on smaller objects.

YOLOv4, YOLOv5, and So Much More

After the release of YOLOv3, Joseph Redmon stepped away from computer vision research. Researchers like Alexey Bochkovskiy and innovators like Glenn Jocher began to open source their advancements in computer vision research. Groups like Baidu also released their own implementations of YOLO (like PP-YOLO), demonstrating an improvement in mAP and decrease in latency.

YOLOv4, released April 2020 by Alexey Bochkovskiy, became the first paper in the "YOLO family" to not be authored by Joseph Redmon. YOLOv4 introduced improvements like improved feature aggregation, a "bag of freebies" (with augmentations), mish activation, and more. (See a detailed breakdown of YOLOv4.)

YOLOv5, released June 2020 by Glenn Jocher, is the first model in the "YOLO family" to not be released with an accompanying paper – and, similarly, undergoing "ongoing development" on its repo. Glenn had been maintaining a version of YOLOv3 implemented in PyTorch, but as he continued to make improvements in the architecture itself, he ultimately decided to release a new repo branded as YOLOv5.

Glenn joined us on Roboflow YouTube to discuss a bit about what's new in YOLOv5.

PP-YOLO, released in August 2020 by Baidu, surpasses YOLOv4's performance metrics on the COCO dataset. The "PP" stands for "PaddlePaddle," Baidu's neural network framework (akin to Google's TensorFlow). PP-YOLO notes improved performance through taking advantage of a replaced improvements like a model backbone, DropBlock regularization, Matrix NMS, and more. (See a detailed breakdown of PP-YOLO.)

Scaled YOLOv4 came out in November 2020 by Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. The model takes advantage of  Cross Stage Partial networks to scale up the size of the network while maintaining both accuracy and speed of YOLOv4. Notably, Scaled YOLOv4 takes advantage of the training framework released in YOLOv5. (See a detailed breakdown of Scaled YOLOv4.)

PP-YOLOv2, again authored by the Baidu team, was released in April 2021. PP-YOLOv2 made minor tweaks to PP-YOLO to achieve improved performance, including adding the mish activation function and Path Aggregation Network (sensing a trend in improvements flowing from one network to the next?). (See a detailed breakdown of PP-YOLOv2)

To YOLO Infinity, and Beyond

Models (loosely) based on the original YOLO paper in 2016 continue to blossom in the computer vision space. We'll continue to keep you up to date on how to train YOLO models, architecture improvements, and more.

Continue subscribing to our newsletter, and happy training!

(Did we miss any? Let us know! We'll update this post as the field evolves.)