Coding agents are quickly becoming the popular way to build applications, as they generate code, run it, debug errors, and iterate autonomously.
But how well do they perform on tasks related to visual understanding and vision applications? We evaluated the top 3 coding agents (claude code, gemini-cli, openai codex) on 5 different vision tasks, and here's how they performed:

Score is a sum of speed (up to 0.3 if below 1min) and accuracy (up to 0.7, task-specific evaluation).
Quick insights:
- Claude code ended up winning in 4/5 tasks π It also used the most tokens, and tried to verify its results
- Gemini won 1/5 tasks, and got 2 other tasks right, but was slower
- Codex ignored instructions and didn't run the script in 2 cases
Agent Evaluation
All agents were invoked via cli and using the following args:
claude --dangerously-skip-permissions --verbose --output-format stream-json [prompt]codex -a never exec --sandbox workspace-write -c features.search=true [prompt]gemini --yolo --output-format stream-json [prompt]
Task 1 - Count birds on an image with SAM3

Prompt: Use SAM3 to annotate birds in the provided image 'birds_sam3.jpg' (already in your workspace). Count the number of detections and print the count. Return the count as number in 'output.txt' fileResults:
- Gemini / Claude: ended up using
sam3py package, and they both patchedPositionEmbeddingSinelayer to use CPU (ran evals on macOS) - Codex: Ignored instructions to use SAM3 for counting, tried using standard coco-trained detector, miscounted the birds
Winners: Claude / Gemini π
Task 2 - Count cars in a video file
Below is a 10 second clip of video. In total, there are about 150-200 cars, but agent gets some points between 50-100 and 200-300.
Use a python script to count the total number of unique cars in the video 'cars-aerial-view.mp4' (already in your workspace). Output the final number to a file named 'output.txt'.Results:
- Claude made and ran the script and counted 270 cars with botsort tracker
- Gemini 37 counted cars with an COCO-based detector
- Codex counted 48 cars with an COCO-based detector
Winner: Claude π
Task 3 - Count cars in RTSP stream
Similar to the task above, but evaluated on live-stream, where inference speed matters. We used mediamtx to serve the video (in a loop) via RTSP. I specifically mention it needs >10 fps, otherwise trackers wouldn't be able to consistently track detected cars.
Prompt: I have an RTSP stream at rtsp://localhost:8554/stream. I want to count the total number of cars in 15 seconds of the video using a python script. Also measure infernece FPS, if it's below 10fps use smaller/faster model. Print that out to the console at the end. Output the final number of unique cars in a file named 'output.txt'! The video is an aerial/top-down view of a vehicle parking lot.Results:
- Claude used a COCO detector and made a simple IoU-based tracker, counted 86 cars
- Gemini tried different models sizes with an off-the-shelf tracker, but only counted 29 cars. Likely the confidence threshold was set too high
- Codex just generated a Py script without actually following instructions and running it π
Winner: Claude π
Task 4 - Count Avocados in an image
This is quite a similar task to the first one, but without specifying which tool (sam3) to use. Agents get points between 25-100, more points the closer it is to 55.

Results:
- Claude used YOLO World with different confidence thresholds, annotated avocados (drew bounding boxes), then used LLM to verify annotated images. Consumed a bunch of tokens and counted 54 avocados
- Gemini used similar approach, but timed out after 10 minutes βοΈ
- Codex used similar approach, counted 43 avocados (no verification). Because it didn't do any verification it was faster, and scored higher overall score.
Winner: Claude/Codex π
Task 5 - License Plate Recognition (LPR)
Here, agents got points for outputting all 3 plate images into the folder, and points for actually doing LPR.

Prompt: Write a Python script that detects and reads license plates from the provided image 'cars-highway.png' in your workspace. The script should:
1. detect vehicles and license plates
2. crop each detected license plate and save the crops to a 'results/' folder as image files
3. run OCR/LPR on each plate to extract the text (no whitespaces or special characters)
4. save all plate numbers to a 'result.txt' file, comma separated, no quotes.Results:
- Claude used a detector trained on COCO for car detection, tried using classical CV approaches (thresholding/contour) for plate detection, and
easyocr/pytessaractfor recognition. Timed out βοΈ - Gemini used a COCO-based detector for car detection,
yasirfaizahmed/license-plate-object-detectionmodel from HF Hub for plate detection, andeasyocrfor LPR. Got 2/3 plates correct π - Codex just wrote a script without running it (ignored instructions).
Winner: Gemini π
What's next?
In this benchmark, Claude Code performed best overall, solving 4 out of 5 tasks successfully. Gemini solved several tasks but was slower, while Codex struggled to follow instructions in this setup.
Weβll continue publishing experiments like this as coding agents evolve.
Cite this Post
Use the following entry to cite this post in your research:
Erik Kokalj. (Mar 16, 2026). Which is the Best Coding Agent for Vision tasks?. Roboflow Blog: https://blog.roboflow.com/best-coding-agent-for-vision-ai/