YOLO26 Depth: Monocular Depth Estimation in Meters

YOLO26-Depth and Monocular Depth Estimation

Published May 28, 2026 • 6 min read

YOLO26 Depth brings monocular depth estimation to the YOLO model family: predicting how far away every pixel in an image is, from a single ordinary camera. Five pretrained models are available, YOLO26n-depth through YOLO26x-depth, and they output metric depth in meters.

YOLO models are a family of real-time computer vision models designed to handle a wide range of tasks, including object detection, segmentation, pose estimation, and classification.

In this guide, we cover what monocular depth estimation is, how YOLO Depth performs on standard benchmarks, what the metric depth output means for real applications, and how to build a working detection-plus-distance system in Roboflow Workflows today, with no GPU and nothing to install.

What Is Monocular Depth Estimation?

Monocular depth estimation is the computer vision task of predicting the distance from the camera to every point in a scene using a single RGB image. The output is a depth map: a dense grid the same size as the input image where each pixel holds a distance value instead of a color.

A single image is ambiguous, making this a hard problem to solve. For example, a toy car photographed up close and a real car photographed from far away can produce nearly identical pixels. Stereo systems solve this with two calibrated cameras and triangulation, the way human eyes do. Monocular models instead learn depth cues from millions of training images: perspective, texture gradients, object size priors, occlusion, and lighting.

There are two types of depth models:

Relative depth models predict the ordering and shape of a scene (this pixel is farther than that one) but not real-world units. They generalize well but need calibration before you can set a threshold like alert at 2 meters.
Metric depth models predict absolute distance in physical units, typically meters. They can drive real spatial decisions directly, at the cost of being more sensitive to the camera and scene they were trained on.

For a broader grounding, see our guide to depth estimation in computer vision.

What Is YOLO26 Depth?

YOLO26 Depth extends the YOLO family lineup with a dedicated depth estimation task.

Given one RGB image, a YOLO26 Depth model returns a dense float map of shape (H, W), aligned to the input resolution, where every pixel value is an estimated distance in meters from the camera to that surface point. In code, predictions come back as a tensor you can index directly, which makes fusing depth with detections a few lines of logic rather than a research project.

The models are pretrained on roughly 2.19 million images across mixed indoor and outdoor datasets, and the task supports training, validation, prediction, and export like the rest of the YOLO task family.

YOLO26 Depth Models and Benchmarks

Five model sizes are available, all evaluated at 768px on the NYU Depth V2 Eigen test split (with multi-scale and horizontal-flip test-time augmentation):

Model	delta1	abs_rel	RMSE	Params	FLOPs
YOLO26n-depth	0.882	0.109	0.414m	6.4M	46.9B
YOLO26s-depth	0.896	0.104	0.399m	13.2M	67.9B
YOLO26m-depth	0.921	0.089	0.364m	23.3M	130.7B
YOLO26l-depth	0.930	0.083	0.351m	27.7M	157.2B
YOLO26x-depth	0.933	0.080	0.344m	57.0M	302.0B

*The above numbers are reported with test-time augmentation, which inflates accuracy relative to single-pass inference in production. And NYU Depth V2 is an indoor benchmark; accuracy on a rail yard or a farm at 40 meters will differ from accuracy in a bedroom at 4 meters.

delta1 is the share of pixels whose predicted depth is within 25% of the true depth. Higher is better; 0.933 means about 93% of pixels land within that band.
abs_rel is the mean absolute error as a fraction of true depth. Lower is better.
RMSE is root mean squared error in meters. On NYU's indoor scenes, the largest model averages about 34cm of error per pixel.

The nano model runs on 6.4M parameters, small enough to consider for edge hardware, while the x variant trades roughly 9x the parameters for about 17% lower RMSE. Pick the smallest model that hits the accuracy you need, then spend the savings on resolution or frame rate.

Does YOLO26 Depth Output Metric or Relative Depth?

YOLO26 Depth outputs metric depth: real distances in meters, with a working range of roughly 0.02 to 150 meters. The architecture uses an exponential log-depth head, an unbounded prediction in log space that separates the shape of the scene from its absolute scale. This avoids the failure mode of bounded heads that clip when a scene falls outside the depth range they were trained on.

Metric output means a rule like alert when a person is within 2 meters of a forklift can be written directly against model output, without a calibration workaround.

There is a scale calibration path when you need it: a two-parameter log-affine fit adjusts the model to a specific camera or scene without retraining. For fixed-camera installs, a one-time calibration against a known reference distance is still good practice before trusting absolute readings.

Fine-Tuning, Datasets, and Export

YOLO26 Depth supports the full task lifecycle:

Training and fine-tuning on custom depth data, with paired RGB images and float32 depth files. The recommended recipe is a low learning rate with AdamW, mixing in a small share of general-purpose images to reduce forgetting.
Validation reporting delta1/2/3, abs_rel, RMSE, and silog metrics.
Built-in dataset configurations for NYU Depth V2, KITTI, Hypersim, SUN RGB-D, ARKitScenes, and Depth8.
Export to 17 formats, including ONNX, TensorRT, CoreML, OpenVINO, and NCNN, covering most edge and cloud targets.

Full technical documentation lives in the depth task docs.

What About Licensing?

The depth documentation does not state a license for the pretrained models, and prior YOLO model releases shipped under AGPL-3.0, which requires open-sourcing derivative software or purchasing a commercial license. If you are deploying commercially, confirm the license terms before building on the weights. For teams that need commercial-safe licensing without an enterprise sales call, see how Roboflow licenses Ultralytics, and note that RF-DETR and the Roboflow model lineup are commercial-safe by design.

What Can You Build With Monocular Depth Estimation?

Depth turns detections into spatial decisions. Because the output is in meters, each of these can be built with thresholds in real units:

Forklift proximity alerts: detect people and forklifts in the same frame and alert when the distance between them drops below a set threshold, using cameras already mounted in the warehouse.
Person-to-person spacing: measure crowd spacing in retail, transit, and event spaces without floor sensors or stereo rigs.
Robotics grasping: a robot arm needs the distance to the part, not just its bounding box. Monocular depth gives grasp planning a third coordinate from one camera.
Object size estimation: with distance known, pixel dimensions convert to physical dimensions, useful for sizing packages, produce, or parts in motion

How to Detect Objects and Estimate Distance From One Camera

The capability YOLO26 Depth promises, detection plus depth from a single camera, is something you can build in Roboflow Workflows right now by fusing a detector with the Depth Estimation block, which runs Depth Anything 3 (in Workflows since January).

The pattern:

Add an Object Detection block to a new Workflow. We recommend RF-DETR, trained on your own classes (person, forklift, part) or a pre-trained checkpoint.
Add the Depth Estimation block on the same input image. It returns a per-pixel depth map alongside your detections.
Fuse the two with a custom logic block: sample the depth map at the center of each bounding box to get a depth value per detected object.
Convert to real-world units with a one-time calibration: record the depth value of an object at a known distance, then scale new readings against that reference.
Deploy with Inference on a video stream, at the edge, on-prem, or in the cloud.

From there, the use cases are logic on top of the same Workflow. A forklift proximity alert compares the depth and image positions of person and forklift detections and fires a notification when they converge. Distance-between-people monitoring applies the same pairwise comparison across person detections. For grasping, the depth at the target's center gives the approach distance, and with distance known, bounding box dimensions convert to physical object size.

RF-DETR is faster and more accurate than YOLO26 for object detection, and it ships with commercial-safe licensing. As the detector in a detection-plus-depth Workflow, better boxes mean better depth samples: the depth value is only as good as the box you sample it from.

YOLO26 Depth Alternatives

See our depth estimation model roundup for the full landscape.

Depth Anything 3

Depth Anything 3 is the current standard for monocular depth estimation and runs today in the Roboflow Workflows Depth Estimation block. It generalizes well to real-world scenes without camera-specific calibration. When YOLO-Depth ships, this is the model its benchmarks will be measured against.

RF-DETR with the Depth Estimation Block

The fusion pattern from the tutorial above, available now. RF-DETR handles real-time detection, Depth Anything 3 handles depth, and a Workflow combines them into per-object distances. Because the pieces are separate, you can swap either model as better ones ship, without rebuilding the pipeline.

Conclusion

Depth estimation is most useful when it is wired into a system that acts on it: an alert, a measurement, a robot command. You can build that today, in a browser, with a free Roboflow account.

Cite this Post

Use the following entry to cite this post in your research:

Erik Kokalj. (May 28, 2026). What Is YOLO26-Depth? Monocular Depth Estimation. Roboflow Blog: https://blog.roboflow.com/what-is-yolo-depth/

Stay Connected

Get the Latest in Computer Vision First

Written by

Erik Kokalj

Developer Experience @ Roboflow

View more posts

Topics

Computer Vision

What Is YOLO26-Depth? Monocular Depth Estimation

What Is Monocular Depth Estimation?

What Is YOLO26 Depth?

YOLO26 Depth Models and Benchmarks

Does YOLO26 Depth Output Metric or Relative Depth?

Fine-Tuning, Datasets, and Export

What About Licensing?

What Can You Build With Monocular Depth Estimation?

How to Detect Objects and Estimate Distance From One Camera

YOLO26 Depth Alternatives

Depth Anything 3

RF-DETR with the Depth Estimation Block

Conclusion

Cite this Post

Written by

Topics

More About Computer Vision

How to Make Automatic Highlight Reels from Kids' Soccer Games

Run RF-DETR in NVIDIA DeepStream on Jetson

Hog Ring Detection with Computer Vision

Gemini 3.6 Flash for Vision: Evaluation and Benchmarks

Flanges Quality Inspection with Computer Vision

Advanced Techniques for Optimizing AI Inference Costs