Realtime object detection advances with the release of YOLOv7, the latest iteration in the life cycle of YOLO models. YOLOv7 infers faster and with greater accuracy than its previous versions (i.e. YOLOv5), pushing the state of the art in object detection to new heights.

The evaluation of YOLOv7 models show that they infer faster (x-axis) and with greater accuracy (y-axis) than comparable realtime object detection models.

In this post, we break down the internals of how YOLOv7 works and the novel research involved in its construction. We'll then get on to a video about how you can train a model using YOLOv7 with a custom dataset and Roboflow.

Jump to Training YOLOv7

If you are more interested in the practical side of YOLOv7, you can skip this breakdown post and jump straight to our YOLOv7 Tutorial. You can train a YOLOv7 model on your custom data in minutes.

What is YOLOv7?

The YOLO (You Only Look Once) v7 model is the latest in the family of YOLO models. YOLO models are single stage object detectors. In a YOLO model, image frames are featurized through a backbone. These features are combined and mixed in the neck, and then they are passed along to the head of the network YOLO predicts the locations and classes of objects around which bounding boxes should be drawn.

YOLO conducts a post-processing via non-maximum supression (NMS) to arrive at its final prediction.

YOLO network architecture as depicted in PP-YOLO

In the beginning, YOLO models were used widely by the computer vision and machine learning communities for modeling object detection because they were small, nimble, and trainable on a single GPU. This is the opposite of the giant transformer architectures coming out of the leading labs in big tech which, while  effective, are more difficult to run on consumer hardware.

Since their introduction in 2015, YOLO models have continued to proliferate in the industry. The small architecture allows new ML engineers to get up to speed quickly when learning about YOLO and the realtime inference speed allows practitioners to allocate minimal hardware compute to power their applications.

YOLOv7 detecting horses with bounding box predictions

The YOLOv7 Authors

The YOLO v7 model was authored by WongKinYiu and Alexey Bochkovskiy  (AlexeyAB).

AlexeyAB took up the YOLO torch from the original author, Joseph Redmon, when Redmon quit the Computer Vision (CV) industry due to ethical concerns. AlexeyAB maintained his fork of YOLOv3 for a while before releasing YOLOv4, an upgrade on the previous model.

WongKinYiu entered the CV research stage with a Cross Stage Partial networks, which allowed YOLOv4 and YOLOv5 to build more efficient backbones. From there, WongKinYiu steamrolled ahead making a large contribution to the YOLO family of research with Scaled-YOLOv4. This model introduces efficient scaling of YOLOv4 CSPs to hit a COCO state of the art mAP.

Scaled-YOLOv4 was the first paper AlexeyAB and WongKinYiu collaborated on, and in doing so they ported YOLOv5 PyTorch implementations over to their line of repositories. This was a more convenient place to research than their traditional YOLOv4 Darknet repository, which already had a defined scope.

After that, WongKinYiu released YOLOR, which introduced new methods of tracking implicit knowledge in tandem with explicit knowledge in neural networks. The video below is a walk given by Alexey Bochkovskiy on real-time object detection with three models in the YOLO family:

WongKinYiu and AlexeyAB are back with the introduction of YOLOv7, quickly following the release of YOLOv6 from the Meituan Technical Team.

What Makes YOLOv7 Different?

The YOLOv7 authors sought to set the state of the art in object detection by creating a network architecture that would predict bounding boxes more accurately than its peers at similar inference speeds.

YOLOv7 evaluates in the upper left - faster and more accurate than its peer networks

In order to achieve these results, the YOLOv7 authors made a number of changes to the YOLO network and training routines. Below, we're going to talk about three notable contributions to the field of computer vision research that were made in the YOLOv7 paper.

Extended Efficient Layer Aggregation

The efficiency of the YOLO networks convolutional layers in the backbone is essential to efficient inference speed. WongKinYiu started down the path of maximal layer efficiency with Cross Stage Partial Networks.

In YOLOv7, the authors build on research that has happened on this topic, keeping in mind the amount of memory it takes to keep layers in memory along with the distance that it takes a gradient to back-propagate through the layers. The shorter the gradient, the more powerfully their network will be able to learn. The final layer aggregation they choose is E-ELAN, an extend version of the ELAN computational block.

The evolution of layer aggregation strategies in YOLOv7

Model Scaling Techniques

Object detection models are typically released in a series of models, scaling up and down in size, because different applications require different levels of accuracy and inference speeds.

Typically, object detection models consider the depth of the network, the width of the network, and the resolution that the network is trained on. In YOLOv7 the authors scale the network depth and width in concert while concatenating layers together. Ablation studies show that this technique keep the model architecture optimal while scaling for different sizes.

Compound scaling in YOLOv7 model sizes

Re-parameterization Planning

Re-parameterization techniques involve averaging a set of model weights to create a model that is more robust to general patterns that it is trying to model. In research, there has been a recent focus on module level re-parameterization where piece of the network have their own re-parameterization strategies.

The YOLOv7 authors use gradient flow propagation paths to see which modules in the network should use re-parameterization strategies and which should not.

Re-parameterization go/no-go findings in YOLOv7

Auxiliary Head Coarse-to-Fine

The YOLO network head makes the final predictions for the network, but since it is so far downstream in the network, it can be advantageous to add an auxiliary head to the network that lies somewhere in the middle. While you are training, you are supervising this detection head as well as the head that is actually going to make predictions.

The auxiliary head does not train as efficiently as the final head because there is less network between it an the prediction - so the YOLOv7 authors experiment with different levels of supervision for this head, settling on a coarse-to-fine definition where supervision is passed back from the lead head at different granularities.

Coarse-to-fine auxiliary head supervision in the YOLOv7 network

For more detail on the research contributions in YOLOv7, check out the YOLOv7 research paper.

The YOLOv7 Codebase

The YOLOv7 GitHub repository contains all of the code you need to get started training YOLOv7 on your custom data.

The network is defined in PyTorch . Training scripts, data loaders, and utility scripts are written in Python.

python train.py --workers 8 --device 0 --batch-size 32 --data data/coco.yaml --img 640 640 --cfg cfg/training/yolov7.yaml --weights '' --name yolov7 --hyp data/hyp.scratch.p5.yaml 

The repository is evolution of the YOLOR and Scaled-YOLOv4 repositories, which is derived from WongKinYiu's YOLOv4-PyTorch repository, which arguably gathers a lot of its backbones from the original YOLOV3-PyTorch and YOLOv5 repositories by Ultralytics.

While initial benchmarks on the COCO benchmark are strong, we will see where the YOLOv7 repository evolves relative to the well-maintained YOLOv5 repository.

The YOLOv7 authors left us with an exciting teaser to watch - the unification of pose and instance segmentation tasks within YOLO, a long anticipated evolution.

Instance segmentation and pose estimation teasers in YOLOv7

YOLOv7 Next Steps

The YOLOv7 models here were all trained to detect the generic 80 classes in the COCO dataset. To use YOLOv7 for your own application, watch our guide on how to  train YOLOv7 on your own custom dataset:

Video guide for training YOLOv7 in Colab

To read about other recent contributions in the field of object detection, check out our breakdown of YOLOv6, which dives deep into the architecture

If you are using your object detection models in production, look to Roboflow for setting up a machine learning operations pipeline around your model lifecycles and deployment schemas.

Frequently Asked Questions

What are the versions of YOLOv7?

YOLOv7 has multiple versions. YOLOv7 is the main version of the model. YOLOv7-tiny is a smaller model that is optimised for inference on edge devices. There is also YOLOv7-W6 which is commonly used in cloud computing.

What is YOLO used for?

YOLO is used to implement real-time object detection algorithms. For instance, you could use YOLO to detect cars in a video, or people walking on a street. One of the most prominent applications for YOLO is autonomous driving applications.