What Is Depth Estimation in Computer Vision?

Published Apr 3, 2026 • 4 min read

Depth estimation is the computer vision task of predicting how far away everything in an image is from the camera, turning 2D pixels into 3D understanding from standard cameras.

Depth estimation is the computer vision task of determining the distance between the camera and the objects in a scene. Where object detection answers what is in an image and where it sits on the image plane, depth estimation answers the third question: how far away each thing is.

The output is a depth map, an image where every pixel value corresponds to distance from the camera. Combined with detection or segmentation, depth turns a flat camera feed into spatial understanding: not just that a person and a forklift share an aisle, but that they are 2.1 meters apart.

In this blog, we'll cover how depth estimation works, the difference between relative and metric depth, the main approaches and how to choose between them, what depth estimation is used for, and how to add it to a vision pipeline today.

What Is a Depth Map?

A depth map is the standard output of depth estimation: a 2D image where each pixel encodes distance rather than color. Visualized, nearby surfaces render in one tone and far surfaces in another, producing the familiar gradient image where a scene's geometry becomes visible at a glance.

Depth maps come in two forms, and the difference matters more than any other concept in this topic:

Relative depth orders the scene: this pixel is closer than that one. The values are consistent within the image but carry no real-world units.
Metric (absolute) depth gives distances in real units: this pallet is 3.4 meters from the camera.

Relative depth is enough when you need ranking and structure, which surfaces are near, what is in front of what. Metric depth is required when a machine acts on the number: stopping distances, grasp targets, measurements. Converting relative depth to metric requires calibration against a known reference, which is a one-time step but real engineering.

How Depth Estimation Works

There are three main routes to depth, two of them camera-based and one using active sensors.

Monocular Depth Estimation

Monocular depth estimation predicts depth from a single ordinary camera. There is no geometric trick here: a neural network learns the visual cues humans use, relative size, perspective, occlusion, texture density, from millions of training images, and applies those priors to new scenes. Models like Depth Anything 3 generalize remarkably well to scenes they have never seen, with no special hardware and no calibration.

The tradeoff is that monocular depth is relative by default. The model knows the forklift is closer than the racking, but converting that to meters requires a calibration reference.

Stereo Depth Estimation

Stereo depth uses binocular disparity, the same principle as human vision. Two cameras a known distance apart (the baseline) capture the same scene, and depth is computed from the difference between the two views. Because the baseline is known, stereo produces metric depth out of the box.

Stereo depth cameras like the Luxonis OAK-D, Stereolabs ZED, and Intel RealSense compute depth onboard and output an aligned depth frame alongside the RGB image. The weak spots are textureless surfaces (blank walls, glass), where matching between the two views fails, and long range, where accuracy degrades as disparity shrinks.

Active Sensors: Lidar, Time-of-Flight, Structured Light

Active sensors measure depth directly by emitting light and timing or analyzing its return. Lidar offers long range and works in darkness, which is why it dominates autonomous driving. The costs are price, no color or texture information, and another sensor to integrate. For most industrial applications working at room-to-warehouse distances, camera-based depth has become the practical default, which is why upcoming models like YOLO-StereoDepth are positioned as camera-native alternatives to lidar.

Choosing an Approach

	Monocular	Stereo	Lidar / active
Depth type	Relative; metric with calibration	Metric from the known baseline	Metric, measured directly
Hardware	Any existing RGB camera	Stereo camera, a few hundred dollars	Dedicated sensor, significantly more
Strengths	Zero new capex, works on installed cameras	Real units at short-to-mid range, color and depth from one device	Long range, darkness, weather
Weak spots	Absolute scale, unusual scenes	Textureless surfaces, long range	Cost, no color, integration
Best for	Retrofits, monitoring, proximity ranking	Robotics, grasping, dimensioning	Autonomous vehicles, long-range mapping

The short version: monocular depth if the cameras are already installed and ranking is enough, stereo if a machine acts on real units, lidar where range and darkness rule cameras out.

What Is Depth Estimation Used For?

Depth turns detections into spatial decisions:

Safety and proximity: forklift proximity alerts that fire when a person and a vehicle close below a threshold, from cameras already mounted in the warehouse
Robotics: navigation, obstacle stopping distances, and grasping, where an arm needs the approach distance to a part, not just its bounding box
Measurement: with distance known, pixel dimensions convert to physical dimensions for sizing packages, produce, and parts in motion
Scene understanding: separating foreground from background for effects, occlusion handling in AR, and smarter region logic in inspection pipelines
Crowd and space analytics: person-to-person distances in retail, transit, and event spaces without floor sensors

How to Use Depth Estimation Today

Depth estimation runs as a building block in Roboflow Workflows through the Depth Estimation block, which runs Depth Anything 3. The most useful pattern fuses it with detection:

Detect the objects you care about. We recommend RF-DETR, which ships with commercial-safe licensing.
Run the Depth Estimation block on the same frame to get a per-pixel depth map.
Sample the depth map at each bounding box to attach a depth value to every detection.
Calibrate once against a known distance if you need real-world units, then build your logic: alerts, measurements, comparisons.
Deploy with Inference on video streams, in the cloud, on-prem, or at the edge.

If your application needs metric depth without the calibration step, a stereo depth camera supplies it onboard, and the same Workflow pattern applies: the camera provides the depth frame, the detector provides the objects, and the logic joins them.

Depth Estimation Models

The model landscape is moving quickly. Depth Anything 3 is the current standard for monocular depth and runs in Roboflow Workflows today. The YOLO27 generation adds two depth models in September 2026: YOLO-Depth for monocular depth and YOLO-StereoDepth for stereo. For the full model landscape, see the depth estimation model roundup.

Depth Estimation Conclusion

Depth estimation gives software the third dimension: the distance from the camera to everything it sees. With strong monocular models running on ordinary cameras and stereo hardware supplying metric depth for a few hundred dollars, 3D understanding no longer requires lidar budgets or research teams.

Cite this Post

Use the following entry to cite this post in your research:

Contributing Writer. (Apr 3, 2026). What Is Depth Estimation in Computer Vision?. Roboflow Blog: https://blog.roboflow.com/depth-estimation-in-computer-vision/

Stay Connected

Get the Latest in Computer Vision First

Written by

Contributing Writer

View more posts

Topics

Computer Vision

What Is Depth Estimation in Computer Vision?

What Is a Depth Map?

How Depth Estimation Works

Monocular Depth Estimation

Stereo Depth Estimation

Active Sensors: Lidar, Time-of-Flight, Structured Light

Choosing an Approach

What Is Depth Estimation Used For?

How to Use Depth Estimation Today

Depth Estimation Models

Depth Estimation Conclusion

Cite this Post

Written by

Topics

More About Computer Vision

How to Make Automatic Highlight Reels from Kids' Soccer Games

Run RF-DETR in NVIDIA DeepStream on Jetson

Hog Ring Detection with Computer Vision

Gemini 3.6 Flash for Vision: Evaluation and Benchmarks

Flanges Quality Inspection with Computer Vision

Advanced Techniques for Optimizing AI Inference Costs