Realtime object detection advances with the release of YOLOv7, the latest iteration in the life cycle of YOLO models.

YOLOv7 infers faster and with greater accuracy than its cohorts, pushing the state of the art in object detection to new heights.

The evaluation of YOLOv7 models show that they infer faster (x-axis) and with greater accuracy (y-axis) than comparable realtime object detection models.

In this post, we break down the internals of how YOLOv7 works and the novel research involved in its construction.

Jump to Training YOLOv7

Skip this breakdown post and jump straight to our YOLOv7 Tutorial. You can train a YOLOv7 model on your custom data in minutes.

The YOLO Model Architecture

The YOLO (You Only Look Once) model is a single stage object detector. Image frames are featurized through a backbone, features are combined and mixed in the neck, and then they are passed along to the head of the network where YOLO predicts the bounding boxes locations, the classes of the bounding boxes, and the objectness of the bounding boxes. YOLO uses post-processing via NMS to arrive at its final prediction.

YOLO network architecture as depicted in PP-YOLO

In the beginning, YOLO models were used widely by the community for modeling object detection because they were small, nimble, and trainable on a single GPU (as opposed to giant transformer architectures coming out of the leading labs in big tech).

Since their introduction in 2015, YOLO models have continued to proliferate in the industry - the small architecture allows new ML engineers to get up to speed quickly when learning about YOLO and the realtime inference speed allows practitioners to allocate minimal hardware compute to power their applications.

YOLOv7 detecting horses with bounding box predictions

The YOLOv7 Authors

WongKinYiu & AlexeyAB

AlexeyAB took up the YOLO torch from the original author. Joseph Redmon, when Redmon quit the CV industry due to ethical concerns. AlexeyAB maintained his fork of YOLOv3 for a while before releasing YOLOv4.

WongKinYiu entered the CV research stage with a Cross Stage Partial networks, which allowed YOLOv4 and YOLOv5 to build more efficient backbones. From there, WongKinYiu steamrolled ahead making a large contribution to the YOLO family of research with Scaled-YOLOv4, an efficient scaling of YOLOv4 CSPs to hit a COCO state of the art mAP.

Scaled-YOLOv4 was the first paper AlexeyAB and WongKinYiu collaborated on, and in doing so they ported YOLOv5 PyTorch implementations over to their line of repositories, a more convenient place to research than their traditional YOLOv4 Darknet repository.

After that, WongKinYiu released YOLOR, which introduced new methods of tracking implicit knowledge in tandem with explicit knowledge in neural networks.

WongKinYiu and AlexeyAB are back with the introduction of YOLOv7, quickly following the release of YOLOv6 from the Meituan Technical Team.

YOLOv7 Research Contributions

The YOLOv7 authors sought to set the state of the art in object detection by creating a network architecture that would predict bounding boxes more accurately than its peers at similar inference speeds.

YOLOv7 evaluates in the upper left - faster and more accurate than its peer networks

In order to achieve these results, the YOLOv7 authors make a number of changes to the YOLO network and training routines.

Extended Efficient Layer Aggregation

The efficiency of the YOLO networks convolutional layers in the backbone is essential to efficient inference speed. WongKinYiu started down the path of maximal layer efficiency with Cross Stage Partial Networks.

In YOLOv7, the authors build on research that has happened on this topic keeping in mind the amount of memory it takes to keep layers in memory along with the distance that it takes a gradient to back-propagate through the layers - the shorter the gradient, the more powerfully their network will be able to learn. The final layer aggregation they choose is E-ELAN, an extend version of the ELAN computational block.

The evolution of layer aggregation strategies in YOLOv7

Model Scaling Techniques

Object detection models are typically released in a series of models, scaling up and down in size, because different applications require different levels of accuracy and inference speeds.

Typically, object detection models consider the depth of the network, the width of the network, and the resolution that the network is trained on. In YOLOv7 the authors scale the network depth and width in concert while concatenating layers together. Ablation studies show that this technique keep the model architecture optimal while scaling for different sizes.

Compound scaling in YOLOv7 model sizes

Re-parameterization Planning

Re-parameterization techniques involve averaging a set of model weights to create a model that is more robust to general patterns that it is trying to model. In research, there has been a recent focus on module level re-parameterization where piece of the network have their own re-parameterization strategies.

The YOLOv7 authors use gradient flow propagation paths to see which modules in the network should use re-parameterization strategies and which should not.

Re-parameterization go/no-go findings in YOLOv7

Auxiliary Head Coarse-to-Fine

The YOLO network head makes the final predictions for the network, but since it is so far downstream in the network, it can be advantageous to add an auxiliary head to the network that lies somewhere in the middle. While you are training, you are supervising this detection head as well as the head that is actually going to make predictions.

The auxiliary head does not train as efficiently as the final head because there is less network between it an the prediction - so the YOLOv7 authors experiment with different levels of supervision for this head, settling on a coarse-to-fine definition where supervision is passed back from the lead head at different granularities.

Coarse-to-fine auxiliary head supervision in the YOLOv7 network

For more detail on the research contributions in YOLOv7, check out the YOLOv7 research paper.

The YOLOv7 Codebase

The YOLOv7 GitHub repository contains all of the code you need to get started training YOLOv7 on your custom data.

The network is defined in PyTorch and training scripts, data loaders, and utility scripts are written in PyThon.

python --workers 8 --device 0 --batch-size 32 --data data/coco.yaml --img 640 640 --cfg cfg/training/yolov7.yaml --weights '' --name yolov7 --hyp data/hyp.scratch.p5.yaml 

The repository is evolution of the YOLOR and Scaled-YOLOv4 repositories, which is derived from WongKinYiu's YOLOv4-PyTorch repository, which arguably gathers a lot of its backbones from the original YOLOV3-PyTorch and YOLOv5 repositories by Ultralytics.

While initial benchmarks on the COCO benchmark are strong, we will see where the YOLOv7 repository evolves relative to the well-maintained YOLOv5 repository.

The YOLOv7 authors left us with an exciting teaser to watch - the unification of pose and instance segmentation tasks within YOLO, a long anticipated evolution.

Instance segmentation and pose estimation teasers in YOLOv7

YOLOv7 Next Steps

The YOLOv7 models here were all trained to detect the generic 80 classes in the COCO dataset. To use YOLOv7 for your own application, see here to train YOLOv7 on your own custom dataset.

To read about other recent contributions in the field of object detection, check out this breakdown of YOLOv6.

If you are using your object detection models in production, look to Roboflow for setting up a machine learning operations pipeline around your model lifecycles and deployment schemas.