A forklift operator pulling a pallet out of a dark trailer does something deceptively hard. They find a worn wooden pallet they can barely see, judge its angle, and slide two forks into the right slot. Computer vision for autonomous mobile robots is the work of teaching a machine to do that same thing, reliably, in warehouses that were never built for robots.
In a recent Roboflow webinar, Vishrut Kaushik, Senior Robotics Engineer at Peer Robotics, walked through how his team builds exactly that: collaborative AMRs that move pallets and trolleys up to 3,000 pounds across live facilities. He is honest about where the problem gets hard, and he shows the loop his team uses to make the robots better every week.
One line from Vishrut captures the whole challenge. As he put it, as a human you have that intuitiveness that you know there is going to be a pallet there even though you cannot see it. The job is building that intuition into a robot.
What Computer Vision for Autonomous Mobile Robots Means
An autonomous mobile robot moves through a space on its own, without fixed tracks or pre-programmed paths. To do that in a warehouse, it has to perceive what is around it: where the floor is clear, where the racks are, and where the specific pallet it was sent to fetch is sitting.
Computer vision for autonomous mobile robots is the perception layer that turns the robot's cameras into that understanding. Peer Robotics builds it around a vision-first platform, the Peer 3000, with six onboard cameras handling everything from navigation to recognizing assets like pallets and trolleys and docking onto them. This is the same category of work covered in Roboflow's guide to vision-guided robotics, applied to heavy material movement.
Most AMRs lean heavily on LiDAR, and Vishrut is clear that LiDAR earns its place. It is excellent at one thing in particular: safety. A safety-rated LiDAR can talk directly to the brakes and stop the robot regardless of what the rest of the compute decides. What LiDAR does not give you is context. It tells the robot there is an obstacle ahead. It does not tell the robot whether that obstacle is a wall, a person, or the exact pallet it came to pick up. That distinction is the reason vision matters.
Why It Matters on the Warehouse Floor
The business case starts with labor. Peer Robotics is aimed at the labor shortage in manufacturing and warehousing, automating the movement of pallets and trolleys with robots that ordinary shop-floor workers can operate by simply pushing them into place. The robot has to fit the operation that already exists, not the other way around.
The harder part is that real operations are messy. Vishrut described a typical e-commerce flow: trucks dock, the robot enters the trailer through the dock plate, unloads pallets, and stages them; later it picks pallets from storage, runs them through a stretch-wrap machine, and loads an outbound trailer. Across that single flow the lighting swings from bright and overexposed near the dock doors to nearly black inside the trailer. The pallets themselves vary from clean wood to worn and dark, and many are wrapped in reflective plastic film that hides the face the robot needs to read.
A perception system that only works in a clean test cell is useless here. The value of computer vision is that it can be trained to hold up across all of those conditions, which is what lets a robot run in a customer facility instead of a lab.
Why Vision Can Finally Handle the Chaos
Two things make this tractable now: the models, and the data loop around them.
On the model side, Peer Robotics runs oriented bounding box detection as its primary model for pallets and trolleys. A standard box is axis-aligned, which is fine for counting but not for docking. An oriented bounding box adds rotation, so the robot recovers the pallet's pose and orientation from vision alone, no LiDAR or 3D data in that step, with a reported precision close to 2 centimeters. That is enough to recognize a pallet from a distance, approach, and correct on the way in, much the way a person does. Real-time detection that runs accurately on limited onboard compute is exactly what Roboflow's RF-DETR is built for.
Beyond detection, Vishrut's team is moving into instance segmentation to give the robot a notion of navigable space. LiDAR might call a patch of floor open when there is actually an uneven surface or a bump a human would steer around. Pixel-level segmentation lets the robot reason about which regions are safe to cross and which classes in the scene are static versus dynamic, which improves mapping and localization in busy environments. Roboflow's RF-DETR-Seg brings the same real-time approach to segmentation tasks like this.
The data loop is the other half. The worst cases, like a dark worn pallet inside a trailer, get labeled as hard examples and kept as a test set, so every new model is checked against the scenarios that actually break it.
More importantly, the robot collects its own failures: when inference misses a pallet in the field, a parallel data collection process captures that frame and the customer can flag it, and the example flows back into training.
That is active learning in production, and it is how a model that started rough gets robust. The whole pipeline runs through Roboflow APIs and Roboflow Workflows: failed detections are uploaded, annotated, retrained in the cloud, and pushed back to the robot as a fine-tuned model.
Vishrut sees the next step as semantic scene understanding: not just move the pallet you found, but judge whether a pallet should be moved at all, whether a load looks unstable, or whether someone nearby is missing safety gear. That is where vision-language models start to matter, and where he expects the real return for customers to show up.
What the Webinar Shows
The webinar is worth watching because it is an engineer talking plainly about a hard production system, not a polished demo reel.
You see the actual operating environment: a robot entering a trailer and docking onto a stretch-wrapped pallet, and the swing from overexposed dock lighting to a near-black trailer interior. You see a real training image, dark and grainy, that Vishrut calls one of the worst cases his team works with, and he explains why a human still solves it and a model struggles. You get a clear, unhyped account of how LiDAR and cameras divide the work, why 2D LiDAR tends to fall short in unstructured warehouses, and what runs onboard versus offline on the robot's CPU and GPU.
The most useful part for anyone building their own system is the timeline. Vishrut started with six or seven pallets placed around a facility in different orientations and roughly 200 to 300 images he labeled himself, and had a working pallet detection model in about a week. As the data set grew and he leaned on the model to pre-label new images, that cycle compressed to the point where a new customer's images can come in and a fine-tuned model can go back to the robot within a day or two. His closing advice is the kind worth hearing twice: obsess over the problem statement before you write a line of code.
Watch the Computer Vision for AMRs Webinar
Watch the full conversation with Vishrut Kaushik of Peer Robotics on building computer vision for autonomous mobile robots.
If you want to build the kind of perception loop he describes, from labeling hard examples to training and deploying to the edge, you can start on Roboflow.
Cite this Post
Use the following entry to cite this post in your research:
Contributing Writer. (May 6, 2026). Computer Vision for Autonomous Mobile Robots (AMRs). Roboflow Blog: https://blog.roboflow.com/computer-vision-for-autonomous-mobile-robots/