Based on Roboflow Vision Evals, Claude Sonnet 5 represents a lateral move for computer vision, tying with its predecessor on benchmark accuracy while struggling with document understanding and object counting. It is outperformed in both vision capabilities and cost-efficiency by alternatives like Gemini 3.5 Flash.
Anthropic released Claude Sonnet 5 on June 30, 2026, the new mid-tier model in the Claude 5 family (launch post). Coding and speed got the headlines. The launch skipped the question we care about: how well does it perform on vision tasks?
We ran it through the Roboflow Vision Evals, 67 real vision prompts across object understanding, spatial understanding, document understanding, defect detection, and counting.
Claude Sonnet 5: Vision Performance
For computer vision, Sonnet 5 is a lateral move. It passes 47 of 67 prompts (70%), dead even with Sonnet 4.6. Fable 5 clears it at 75%, and Gemini 3.5 Flash still leads at 79%.

Where Claude Sonnet 5 Slips
The tie with Sonnet 4.6 hides a trade. Sonnet 5 gains on object understanding (93% versus 71%) but gives it back elsewhere. Document understanding drops to 67%, the weakest of the four, and object counting falls to 20% (2 of 10), the lowest score in the group. Counts hold up when objects are few and separated, then fall apart under clutter and occlusion. That is the weakness every VLM shares, and Sonnet 5 sits at the bottom of this set.
Pricing and Price-Performance for Claude Sonnet 5
Sonnet 5 launched at introductory pricing through August 31, 2026, then rises to the standard Sonnet tier. Here is how the four models line up on vision score against token price:
| Model | Vision score | Input | Output |
|---|---|---|---|
| Gemini 3.5 Flash | 79% | $1.50 / MTok | $9 / MTok |
| Claude Fable 5 | 75% | $10 / MTok | $50 / MTok |
| Claude Sonnet 5 | 70% | $3 / MTok* | $15 / MTok* |
| Claude Sonnet 4.6 | 70% | $3 / MTok | $15 / MTok |
* Introductory pricing of $2 / MTok input and $10 / MTok output applies through August 31, 2026.
The model that scores nine points higher on vision also costs the least per token. Sonnet 5 is behind on accuracy and above on price, and it matches Sonnet 4.6 on both. Measured on vision accuracy per dollar, it is not the value pick at its tier.
When to Use Claude Sonnet 5
Sonnet 5 is a capable reasoning model for visual question answering and general image understanding. If you already run Sonnet for text and want to add image input, it will describe a scene and answer questions about it well. It is not a step forward for vision over Sonnet 4.6, and it is not the strongest option at its tier.
For detection, counting, and segmentation, skip the VLM. A fine-tuned RF-DETR model beats any frontier VLM on accuracy, at a fraction of the cost and latency, and you can chain the two in Roboflow Workflows. If you want the best vision accuracy per dollar today, Gemini 3.5 Flash still leads.
Compare Sonnet 5 against every model we have benchmarked, and test it on your own images, at Roboflow Playground Evals.
Cite this Post
Use the following entry to cite this post in your research:
Erik Kokalj. (Jul 3, 2026). Claude Sonnet 5 for Vision: Evaluation and Benchmarks. Roboflow Blog: https://blog.roboflow.com/claude-sonnet-5-for-vision/