Roboflow YOLO-Depth
Published May 28, 2026 • 4 min read
SUMMARY

YOLO-Depth is an announced monocular depth estimation model in the YOLO family, predicting per-pixel distance from a single camera, expected September 2026.

YOLO models are a family of real-time computer vision models designed to handle a wide range of tasks, including object detection, segmentation, pose estimation, and classification.

YOLO-Depth has been announced as part of YOLO27, described as monocular depth estimation from a single camera, with a release planned for September 2026.

In this blog, we'll cover what YOLO-Depth is, what it could be used for, what remains unknown ahead of release, and a tutorial for combining object detection with depth estimation in a single pipeline today.

What Is YOLO-Depth?

YOLO-Depth is an announced monocular depth estimation model in the YOLO family, unveiled as part of the YOLO27 generation. Monocular depth estimation takes a standard 2D image from a single camera and produces a depth map, an image where every pixel value corresponds to distance from the camera.

Every prior YOLO task answered questions on the image plane: what is in the frame and where. Depth adds the third dimension, how far away each thing is, from the same camera feed.

The appeal is hardware cost. A single RGB camera is the cheapest, most widely deployed sensor in the world. If a model can extract usable depth from cameras already installed on a production line, in a warehouse, or on a vehicle, teams get 3D understanding without new sensors or new capex.

The historical tradeoff is that monocular depth is relative rather than absolute. Models like Depth Anything 3 predict which pixels are closer and which are farther with impressive consistency, but converting that to real-world units (meters, centimeters) requires calibration against a known reference. Whether YOLO-Depth ships metric depth out of the box or relative depth like its peers is one of the open questions below.

What Could YOLO-Depth Be Used For?

Depth turns detections into spatial decisions. The use cases that benefit most are the ones where what matters is not just that an object is present, but how far away it is:

  • Forklift proximity alerts: detect people and forklifts in the same frame and alert when the distance between them drops below a threshold, from cameras already mounted in the warehouse
  • Social distancing and crowd spacing: measure person-to-person distance in retail, transit, and event spaces without floor sensors or stereo rigs
  • Robotics grasping: a robot arm needs the distance to the part, not just its bounding box. Monocular depth gives grasp planning a third coordinate from one camera.
  • Object size estimation: with distance known, pixel dimensions convert to physical dimensions, useful for sizing packages, produce, or parts in motion

What We Don't Know Yet About YOLO-Depth

As of this writing, we have not seen:

  • Benchmarks: no accuracy or latency numbers, and no comparison against existing depth models like Depth Anything 3
  • Relative or metric depth: whether YOLO-Depth outputs calibrated real-world distances or relative depth requiring a calibration step, which determines how much engineering the use cases above actually need
  • Model sizes: no confirmation of the Nano through Extra Large variant lineup used in previous generations
  • Edge performance: depth models are heavier than detectors, and whether YOLO-Depth runs at camera frame rates on edge hardware is unstated
  • Licensing: YOLO-Depth licensing terms have not been announced. Previous similar releases shipped under AGPL-3.0, which requires open-sourcing derivative works unless you purchase a commercial license. If you are evaluating models for commercial deployment, this is worth confirming before you build on it.
  • A paper: there are no indicated plans for a formal research paper

How to Detect Objects and Estimate Distance From One Camera Today

The capability YOLO-Depth promises, detection plus depth from a single camera, is something you can build in Roboflow Workflows right now by fusing a detector with the Depth Estimation block, which runs Depth Anything 3 (in Workflows since January).

The pattern:

  1. Add an Object Detection block to a new Workflow. We recommend RF-DETR, trained on your own classes (person, forklift, part) or a pre-trained checkpoint.
  2. Add the Depth Estimation block on the same input image. It returns a per-pixel depth map alongside your detections.
  3. Fuse the two with a custom logic block: sample the depth map at the center of each bounding box to get a depth value per detected object.
  4. Convert to real-world units with a one-time calibration: record the depth value of an object at a known distance, then scale new readings against that reference.
  5. Deploy with Inference on a video stream, at the edge, on-prem, or in the cloud.

From there, the use cases are logic on top of the same Workflow. A forklift proximity alert compares the depth and image positions of person and forklift detections and fires a notification when they converge. Distance-between-people monitoring applies the same pairwise comparison across person detections. For grasping, the depth at the target's center gives the approach distance, and with distance known, bounding box dimensions convert to physical object size.

RF-DETR is faster and more accurate than YOLO26 for object detection, and it ships with commercial-safe licensing. As the detector in a detection-plus-depth Workflow, better boxes mean better depth samples: the depth value is only as good as the box you sample it from.

YOLO-Depth Alternatives

While YOLO-Depth is not yet available, monocular depth estimation is a mature space. See our depth estimation model roundup for the full landscape.

Depth Anything 3

Depth Anything 3 is the current standard for monocular depth estimation and runs today in the Roboflow Workflows Depth Estimation block. It generalizes well to real-world scenes without camera-specific calibration. When YOLO-Depth ships, this is the model its benchmarks will be measured against.

RF-DETR with the Depth Estimation Block

The fusion pattern from the tutorial above, available now. RF-DETR handles real-time detection, Depth Anything 3 handles depth, and a Workflow combines them into per-object distances. Because the pieces are separate, you can swap either model as better ones ship, without rebuilding the pipeline.

YOLO-StereoDepth

Announced alongside YOLO-Depth in the YOLO27 generation, YOLO-StereoDepth computes absolute depth from two cameras using binocular disparity. If your deployment can mount a calibrated stereo pair and needs metric distances without a calibration workaround, it is the sibling model to watch (though also will not be released until sometime in September 2026).

YOLO-Depth Conclusion

YOLO-Depth points at the same shift the rest of YOLO27 does: from understanding what is in a frame to understanding the physical space around a camera.

Interested in the latest tech? Starting today, you can use Claude Fable 5, Anthropic's most capable model, in Roboflow. You can add it to a production pipeline in minutes, no API wiring required.

Cite this Post

Use the following entry to cite this post in your research:

Contributing Writer. (May 28, 2026). What Is YOLO-Depth?. Roboflow Blog: https://blog.roboflow.com/what-is-yolo-depth/

Stay Connected
Get the Latest in Computer Vision First
Unsubscribe at any time. Review our Privacy Policy.

Written by

Contributing Writer