Depth estimation is the computer vision task of predicting how far away everything in an image is from the camera, turning 2D pixels into 3D understanding from standard cameras.
Depth estimation is the computer vision task of determining the distance between the camera and the objects in a scene. Where object detection answers what is in an image and where it sits on the image plane, depth estimation answers the third question: how far away each thing is.
The output is a depth map, an image where every pixel value corresponds to distance from the camera. Combined with detection or segmentation, depth turns a flat camera feed into spatial understanding: not just that a person and a forklift share an aisle, but that they are 2.1 meters apart.
In this blog, we'll cover how depth estimation works, the difference between relative and metric depth, the main approaches and how to choose between them, what depth estimation is used for, and how to add it to a vision pipeline today.
What Is a Depth Map?
A depth map is the standard output of depth estimation: a 2D image where each pixel encodes distance rather than color. Visualized, nearby surfaces render in one tone and far surfaces in another, producing the familiar gradient image where a scene's geometry becomes visible at a glance.
Depth maps come in two forms, and the difference matters more than any other concept in this topic:
- Relative depth orders the scene: this pixel is closer than that one. The values are consistent within the image but carry no real-world units.
- Metric (absolute) depth gives distances in real units: this pallet is 3.4 meters from the camera.
Relative depth is enough when you need ranking and structure, which surfaces are near, what is in front of what. Metric depth is required when a machine acts on the number: stopping distances, grasp targets, measurements. Converting relative depth to metric requires calibration against a known reference, which is a one-time step but real engineering.
How Depth Estimation Works
There are three main routes to depth, two of them camera-based and one using active sensors.
Monocular Depth Estimation
Monocular depth estimation predicts depth from a single ordinary camera. There is no geometric trick here: a neural network learns the visual cues humans use, relative size, perspective, occlusion, texture density, from millions of training images, and applies those priors to new scenes. Models like Depth Anything 3 generalize remarkably well to scenes they have never seen, with no special hardware and no calibration.
The tradeoff is that monocular depth is relative by default. The model knows the forklift is closer than the racking, but converting that to meters requires a calibration reference.
Stereo Depth Estimation
Stereo depth uses binocular disparity, the same principle as human vision. Two cameras a known distance apart (the baseline) capture the same scene, and depth is computed from the difference between the two views. Because the baseline is known, stereo produces metric depth out of the box.
Stereo depth cameras like the Luxonis OAK-D, Stereolabs ZED, and Intel RealSense compute depth onboard and output an aligned depth frame alongside the RGB image. The weak spots are textureless surfaces (blank walls, glass), where matching between the two views fails, and long range, where accuracy degrades as disparity shrinks.
Active Sensors: Lidar, Time-of-Flight, Structured Light
Active sensors measure depth directly by emitting light and timing or analyzing its return. Lidar offers long range and works in darkness, which is why it dominates autonomous driving. The costs are price, no color or texture information, and another sensor to integrate. For most industrial applications working at room-to-warehouse distances, camera-based depth has become the practical default, which is why upcoming models like YOLO-StereoDepth are positioned as camera-native alternatives to lidar.
Choosing an Approach
| Monocular | Stereo | Lidar / active | |
|---|---|---|---|
| Depth type | Relative; metric with calibration | Metric from the known baseline | Metric, measured directly |
| Hardware | Any existing RGB camera | Stereo camera, a few hundred dollars | Dedicated sensor, significantly more |
| Strengths | Zero new capex, works on installed cameras | Real units at short-to-mid range, color and depth from one device | Long range, darkness, weather |
| Weak spots | Absolute scale, unusual scenes | Textureless surfaces, long range | Cost, no color, integration |
| Best for | Retrofits, monitoring, proximity ranking | Robotics, grasping, dimensioning | Autonomous vehicles, long-range mapping |
The short version: monocular depth if the cameras are already installed and ranking is enough, stereo if a machine acts on real units, lidar where range and darkness rule cameras out.
What Is Depth Estimation Used For?
Depth turns detections into spatial decisions:
- Safety and proximity: forklift proximity alerts that fire when a person and a vehicle close below a threshold, from cameras already mounted in the warehouse
- Robotics: navigation, obstacle stopping distances, and grasping, where an arm needs the approach distance to a part, not just its bounding box
- Measurement: with distance known, pixel dimensions convert to physical dimensions for sizing packages, produce, and parts in motion
- Scene understanding: separating foreground from background for effects, occlusion handling in AR, and smarter region logic in inspection pipelines
- Crowd and space analytics: person-to-person distances in retail, transit, and event spaces without floor sensors
How to Use Depth Estimation Today
Depth estimation runs as a building block in Roboflow Workflows through the Depth Estimation block, which runs Depth Anything 3. The most useful pattern fuses it with detection:
- Detect the objects you care about. We recommend RF-DETR, which ships with commercial-safe licensing.
- Run the Depth Estimation block on the same frame to get a per-pixel depth map.
- Sample the depth map at each bounding box to attach a depth value to every detection.
- Calibrate once against a known distance if you need real-world units, then build your logic: alerts, measurements, comparisons.
- Deploy with Inference on video streams, in the cloud, on-prem, or at the edge.
If your application needs metric depth without the calibration step, a stereo depth camera supplies it onboard, and the same Workflow pattern applies: the camera provides the depth frame, the detector provides the objects, and the logic joins them.
Depth Estimation Models
The model landscape is moving quickly. Depth Anything 3 is the current standard for monocular depth and runs in Roboflow Workflows today. The YOLO27 generation adds two depth models in September 2026: YOLO-Depth for monocular depth and YOLO-StereoDepth for stereo. For the full model landscape, see the depth estimation model roundup.
Depth Estimation Conclusion
Depth estimation gives software the third dimension: the distance from the camera to everything it sees. With strong monocular models running on ordinary cameras and stereo hardware supplying metric depth for a few hundred dollars, 3D understanding no longer requires lidar budgets or research teams.
Cite this Post
Use the following entry to cite this post in your research:
Contributing Writer. (Apr 3, 2026). What Is Depth Estimation in Computer Vision?. Roboflow Blog: https://blog.roboflow.com/depth-estimation-in-computer-vision/