YOLOS - You Only Look At One Sequence is the newest, and potentially most impactful, iteration on the YOLO family of object detection models.

Diagram of the YOLOS network architecture

In this post, we will unpack YOLOS.

YOLO Backbone Until This Point

In all other YOLO models, the backbone for creating features from images is a variation of convolutional neural networks. While the YOLO models tend to disagree on exactly how this CNN is formed, they all build the neck of the object detection model on the CNN backbone.

Best YOLO diagram I have seen, courtesy of the PP-YOLO paper

The Ascent of Transformers

Transformers from the famous Attention is All you Need have totally "transformed" the world of NLP, theorized to be even more broad ranging in their ability to model any mathematical transformation between data and predictions.

In the last year, transformers have broadened their scope to the world of computer vision, setting new standards in image classification with ViT, the first Vision Transformer. The Vision Transformer treats patches of image pixels as sequences, much like the sequences of text tokens that we are familiar hearing about in models like GPT and BERT.

The Vision Transformer for image classification 

YOLOS Transformer Architecture

Unlike previous CNN based YOLO models, the YOLOS backbone is a Transformer block, much like the first vision transformer for classification.

The YOLOS model architecture

In addition to the Vision Transformer, YOLOS has a detector portion of the network that maps a generated sequence of detection representations to class and box predictions.

YOLOS is a YOLO model through the fact that YOLOS only looks at the sequence of image patches once, making it a "You Only Look Once" model.

Aside from that fact, the YOLOS network architecture shares nothing else with previous YOLO Models.

YOLOS Evaluation - Should I Switch to YOLOS?

In relation to other YOLO models, the accuracy of YOLOS is not yet the best in class.

So it is worth looking into YOLOS for research purposes - otherwise we would recommend staying tuned for future iterations of transformers in object detection.

YOLOS evaluation relative to comparable YOLO models (YOLOv3-Tiny, YOLOv4-Tiny)

Training YOLOS

If you want to train YOLOS on your own data, you can stay tuned by subscribing to the Roboflow blog - or you can check out the YOLOS repo source.

If you are just seeking the current best performance on your custom dataset, we recommend starting by training YOLOv5.

Happy training and as always, happy detecting!