YOLO (You Only Look Once) is a family of computer vision models that has gained significant fanfare since Joseph Redmon, Santosh Divvala,  Ross Girshick, and Ali Farhadi introduced the novel architecture in 2016 at CVPR – even winning OpenCV's People Choice Awards.

OpenCV CEO Satya Mallick and Roboflow CEO Joseph Nelson discuss all the architectures in the YOLO family of models.

But what exactly is YOLO (besides an outdated Drake song)? Where did it come from, why is it novel, and  why do there seem to be so many versions?

The Origins of YOLO: You Only Look Once

The original YOLO (You Only Look Once) was written by Joseph Redmon in a custom framework called Darknet. Darknet is a very flexible research framework written in low level languages and has produced a series of the best realtime object detectors in computer vision: YOLO, YOLOv2, YOLOv3, YOLOv4, YOLOv5, YOLOV6, and now YOLOV7.

The YOLO family of models has continued to evolve since the initial release in 2016.  Notably, YOLOv2 and YOLOv3 are both by Joseph Redmon. YOLO models after YOLOv3 are written by new authors and – rather than being considered strictly sequential releases to YOLOv3 – have varying goals based on the authors' whom released them.

The original YOLO model was the first object detection network to combine the problem of drawing bounding boxes and identifying class labels in one end-to-end differentiable network.

Some object detection models treat the detection pass as a two-part problem. First, identify the regions of interest (bounding boxes) for where an object is present. Second, classify that specific region of interest. This is considered a two-stage detector, and popular models like Faster RCNN leverage this approach.

The YOLO Algorithm and Architecture

YOLO is a single stage detector, handling both the object identification and classification in a single pass of the network. YOLO is not the only single stage detection models (e.g. MobileNetSSDv2 is another popular single shot detector), but it is generally more performant in terms of speed and accuracy.

In treating the detection task as a single shot regression approach for identifying bounding boxes, YOLO models are often very fast and very small – often making them faster to train and easier to deploy, especially to compute-limited edge devices.

What is YOLO used for?

The YOLO algorithm is used for real-time object detection. Before YOLO, R-CNNs were among the most common method of detecting objects, but the were slow and not as useful for real-time applications. YOLO provides the speed necessary for use cases that require fast inference, such as detecting cars, identifying animals, and monitoring for safety violations.

Here are a few more situations where YOLO might be useful:

  1. Identifying intruders in a factory.
  2. Monitoring vehicle movement on a building site.
  3. Understanding traffic patterns on a road (i.e. finding the times during which a road is most and least used).
  4. Identifying smoke from fires in the wild.
  5. Monitoring to ensure workers wear the right PPE in certain environments (i.e. while using tools or working with chemicals that emit hazardous fumes).

The use cases for YOLO are vast. If you need to set up a camera to detect objects in real-time, YOLO is always worth considering.

What is YOLOv2?

YOLOv2 was released in 2017, earning an honorable paper mention at CVPR 2017. The architecture made a number of iterative improvements on top of YOLO including BatchNorm, higher resolution, and anchor boxes.

YOLOv3: The next version

YOLOv3 was released in 2018. (The YOLOv3 paper is perhaps one of the most readable papers in computer vision research given its colloquial tone.)

YOLOv3 built upon previous models by adding an objectness score to bounding box prediction, added connections to the backbone network layers, and made predictions at three separate levels of granularity to improve performance on smaller objects.

The newest YOLO models: A deep dive

After the release of YOLOv3, Joseph Redmon stepped away from computer vision research. Researchers like Alexey Bochkovskiy and innovators like Glenn Jocher began to open source their advancements in computer vision research. Groups like Baidu also released their own implementations of YOLO (like PP-YOLO), demonstrating an improvement in mAP and decrease in latency.

Let's talk through each of the newest YOLo models one-by-one.

YOLOv4

YOLOv4, released April 2020 by Alexey Bochkovskiy, became the first paper in the "YOLO family" to not be authored by Joseph Redmon. YOLOv4 introduced improvements like improved feature aggregation, a "bag of freebies" (with augmentations), mish activation, and more. (See a detailed breakdown of YOLOv4.)

YOLOv5

YOLOv5, released June 2020 by Glenn Jocher, is the first model in the "YOLO family" to not be released with an accompanying paper – and, similarly, undergoing "ongoing development" on its repo. Glenn had been maintaining a version of YOLOv3 implemented in PyTorch, but as he continued to make improvements in the architecture itself, he ultimately decided to release a new repo branded as YOLOv5.

The video below is an interview with the Roboflow CEO Joseph Nelson and the developer of YOLOv5, Glenn Jocher. The interview is full of insights on what's new in the YOLOv5 model.

Glenn joined us on Roboflow YouTube to discuss a bit about what's new in YOLOv5.

PP-YOLO

PP-YOLO, released in August 2020 by Baidu, surpasses YOLOv4's performance metrics on the COCO dataset. The "PP" stands for "PaddlePaddle," Baidu's neural network framework (akin to Google's TensorFlow). PP-YOLO notes improved performance through taking advantage of a replaced improvements like a model backbone, DropBlock regularization, Matrix NMS, and more. (See a detailed breakdown of PP-YOLO.)

Scaled YOLOv4

Scaled YOLOv4 came out in November 2020 by Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. The model takes advantage of  Cross Stage Partial networks to scale up the size of the network while maintaining both accuracy and speed of YOLOv4. Notably, Scaled YOLOv4 takes advantage of the training framework released in YOLOv5. (See a detailed breakdown of Scaled YOLOv4.)

PP-YOLOv2

PP-YOLOv2, again authored by the Baidu team, was released in April 2021. PP-YOLOv2 made minor tweaks to PP-YOLO to achieve improved performance, including adding the mish activation function and Path Aggregation Network (sensing a trend in improvements flowing from one network to the next?).

For more information, check out our detailed breakdown of PP-YOLOv2.

YOLOv5

YOLOv5 only requires the installation of torch and some lightweight python libraries. The YOLOv5 models train extremely quickly which helps cut down on experimentation costs as you build your model. You can infer with YOLOv5 on individual images, batch images, video feeds, or webcam ports and easily translate YOLOv5 from PyTorch weights to ONXX weights to CoreML to iOS.

We go into a lot more depth about YOLOv5 in our YOLOv5 deep dive.

YOLOv6

YOLOv6 iterates on the YOLO backbone and neck by redesigning them with the hardware in mind - introducing what they call EfficientRep Backbone and Rep-PAN Neck.

In YOLOv6, the head is decoupled, meaning the network has additional layers separating these features from the final head, which has empirically been shown to increase performance. In addition to architectural changes, the YOLOv6 repository also implements some enhancements to the training pipeline including anchor free (not NMS free) training, SimOTA tag assignment, and SIoU box regression loss.

See our detailed breakdown of YOLOv6 for more information.

YOLOv7

In YOLOv7, the authors build on research that has happened on this topic keeping in mind the amount of memory it takes to keep layers in memory along with the distance that it takes a gradient to back-propagate through the layers - the shorter the gradient, the more powerfully their network will be able to learn. The final layer aggregation they choose is E-ELAN an extend version of the ELAN computational block.

See our detailed breakdown of YOLOv7 to learn more.

Why should I use a YOLO model?

YOLO models are popular for a variety of reasons. First, YOLO models are fast. This enables you to use YOLO models to process video feeds at a high frames-per-second rate. This is useful if you are planning to run live inference on a video camera to track something that changes quickly (i.e. the position of a ball on a football court, or the position of a package on a conveyor belt).

Second, YOLO models continue to lead the way in terms of accuracy. The YOLOv7 model, the latest in the family of models as of November 2022, has state-of-the-art performance when measured against the MS COCO object detection dataset. While there are other great models out there, YOLO has a strong reputation for its accuracy.

Third, the YOLO family of models are open source, which has created a vast community of YOLO enthusiasts who are actively working with and discussing these models. This means that there is no shortage of information on the web about the how and why behind these models, and getting help is not difficult if you reach out to the computer vision community.

Combining all of these three factors, it is clear why so many engineers use YOLO to power their computer vision models.

To (YOLO) Infinity, and Beyond

Models (loosely) based on the original YOLO paper in 2016 continue to blossom in the computer vision space. We'll continue to keep you up to date on how to train YOLO models, architecture improvements, and more.

Continue subscribing to our newsletter, and happy training!

(Did we miss any? Let us know! We'll update this post as the field evolves.)