What does it cost to process an image with a vision model?

Published May 4, 2026 • 6 min read

A reproducible breakdown of GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro. Updated May 2026.

Why VLM pricing is harder than LLM pricing

Estimating the cost of an LLM call is mostly arithmetic. Count the input tokens, count the output tokens, multiply by the rate card, done. Vision-language models break that habit. The same JPEG can become 87 tokens on one provider and 6,636 on another, before the model has generated a single word of output. If you are sizing a workload, the question of how much it costs to process an image only has an answer once you specify the image, the provider, and what you want back.

This piece walks through the cost equation, the per-provider tokenization rules as of May 2026, and a worked grid across five image sizes. The goal is to give you something you can plug your own numbers into.

The VLM cost equation

Cost per image = (image input tokens + text input tokens) × input price + output tokens × output price

Three of those four terms behave like normal LLM math. The fourth, image input tokens, is where the providers diverge. The rest of this post focuses there, because that is the hardest part when making a budget.

For the comparisons below, we hold text input and output constant (a 100-token instruction, a 500-token JSON response) and vary the image. That isolates the variable that vision pricing actually depends on.

How each provider turns pixels into tokens

OpenAI GPT-5.5

GPT-5.5 uses patch-based image tokenization. Images are covered by 32 by 32 pixel patches, and the image token count is based on the number of patches after any model resizing. In `high` detail mode, GPT-5.5 allows up to 2,500 patches or a 2,048-pixel maximum dimension. If either limit is exceeded, the image is resized while preserving aspect ratio.

In `original` detail mode, GPT-5.5 allows up to 10,000 patches or a 6,000-pixel maximum dimension. One important gotcha: on GPT-5.5, omitted `detail` and `auto` behave like `original`, not `high`. For the comparison grid below, we use `detail: "high"`.

Input price: $5.00 per million tokens for GPT-5.5 standard input.

Anthropic Claude Opus 4.7

Anthropic uses an area-based formula. Image tokens approximate (width × height) / 750. The long edge is capped at 2,576 pixels in Opus 4.7, up from 1,568 in prior Claude models. Anything larger gets resized down before tokenization.

There is one wrinkle worth knowing about. Opus 4.7 ships with a new tokenizer that produces 1.0x to 1.35x more tokens for the same input compared to Opus 4.6. Image tokens are affected too, so a phone photo that cost X on Opus 4.6 can cost noticeably more on Opus 4.7 even at the same nominal price per token.

Input price: $5.00 per million tokens.

Google Gemini 3.1 Pro

Gemini has the simplest rule. Images where both dimensions are 384 pixels or smaller cost a flat 258 tokens. Anything larger is cropped and scaled as needed into 768 by 768 tiles, and each tile costs 258 tokens.

Input price: $2.00 per million tokens (standard context). The lower per-token price partially offsets the higher tile count on big images.

VLM pricing comparison grid

Five representative image sizes, run through each provider's rule. Image input tokens only.

Image	GPT-5.5 tokens	Claude tokens	Gemini tokens
Thumbnail (256×256)	64	87	258
Web image (1024×1024)	1,024	1,398	1,032
Phone photo (4032×3024)	2,451	6,636	6,192
Document scan (2480×3508)	2,478	6,255	5,160
4K video frame (3840×2160)	2,304	4,977	3,870

Translating to dollars at current input prices:

Image	GPT-5.5	Claude Opus 4.7	Gemini 3.1 Pro
Thumbnail (256×256)	$0.00032	$0.00044	$0.00052
Web image (1024×1024)	$0.00512	$0.00699	$0.00206
Phone photo (4032×3024)	$0.01226	$0.03318	$0.01238
Document scan (2480×3508)	$0.01239	$0.03127	$0.01032
4K video frame (3840×2160)	$0.01152	$0.02489	$0.00774

The same grid at one million images, to give you an idea for real world applications like the volume of an inspection line, content moderation pipeline, or document processing:

Image	GPT-5.5	Claude Opus 4.7	Gemini 3.1 Pro
Thumbnail (256×256)	$320	$435	$516
Web image (1024×1024)	$5,120	$6,990	$2,064
Phone photo (4032×3024)	$12,255	$33,180	$12,384
Document scan (2480×3508)	$12,390	$31,275	$10,320
4K video frame (3840×2160)	$11,520	$24,885	$7,740

These numbers are image-input only. Add 100 input tokens for the instruction and 500 output tokens for a JSON response and the total per call goes up by roughly $0.0130 on Claude, $0.0155 on GPT-5.5, and $0.0062 on Gemini, depending on output rates. For binary classification (one-token outputs), output cost is negligible. For long-form analysis (2,000+ output tokens), output cost can dominate the image cost entirely.

Key takeaways from comparing VLMs

A few things that matter when you turn this into a budget.

The same image can produce very different token counts across providers. A phone photo is about 2,451 image tokens on GPT-5.5, 6,636 on Claude, and 6,192 on Gemini. That is a 2.7x spread between GPT-5.5 and Claude before output tokens.

Those differences come from tokenization rules, not just price. GPT-5.5 uses patch-based accounting with a patch budget in `high` detail mode. Claude uses an area-based formula after resizing. Gemini uses fixed-cost image tiles.

GPT-5.5 is capped in `high` detail mode, so large images tend to cluster in the low thousands of tokens rather than growing indefinitely. If you use `original` or leave `detail` on default/`auto`, GPT-5.5 token counts can be much higher.

The cheapest provider depends on the image. Claude wins on tiny images. Gemini wins on several medium and large rows. GPT-5.5 is competitive on large natural images and much cheaper than Claude there.

Output tokens can change the ranking. This grid is image-input only; long JSON responses or detailed reports can dominate total cost.

Generality becomes a tax at production scale

Frontier VLMs are the right tool when you need general reasoning over an image, when prompt iteration matters more than per-call cost, or when volumes are low enough that an extra cent per image is invisible. A few thousand calls a day, a few cents each, is fine.

The math changes at scale. A factory inspection line running at 30 frames per second on three cameras is 7.8 million images a day. At about $0.002 per image, roughly the cheapest web-resolution cell in the grid above, that is $15,600 per day, every day, for one line. Add output tokens, retries, and a redundant model for cross-checking, and the number doubles.

At that volume, generality becomes a tax. Most production vision workloads do not need a model that can also write poetry; they need a model that runs a specific task fast and cheap on specific hardware.

This is the gap that purpose-built vision models fill. A fine-tuned RF-DETR running on an edge GPU can do object detection at sub-millisecond latency for a fraction of a cent per frame, and it does not pay for tokens at all. Roboflow exists because at production scale the right answer is usually not an API call to a frontier VLM. It is a smaller, specialized model trained on your data and deployed where the cameras actually are.

The frontier VLMs still have a role in that pipeline. They are useful for bootstrapping labels, handling the long tail of edge cases, and debugging failure modes. The point is not to pick one tool. It is to know where each tool earns its keep, which starts with knowing what each one actually costs.

VLM cost calculator

The formulas are stable enough to spreadsheet. If you want to plug in your own image distribution and instruction lengths, the rules above are everything you need. The python below reproduces the numbers in this post.

import math

def gpt55_tokens(w, h, detail="high"):
"""
Approximate GPT-5.5 image input tokens.

GPT-5.5 uses 32x32 patch-based image tokenization.
For cost-controlled workloads, explicitly set detail="high";
GPT-5.5 default/auto behaves like "original".
"""
if detail == "low":
# Low detail receives a 512x512 version of the image.
# 512 / 32 = 16 patches per side.
return 16 * 16

if detail == "high":
patch_budget = 2500
max_dim = 2048
elif detail in ("original", "auto"):
patch_budget = 10000
max_dim = 6000
else:
raise ValueError("detail must be 'low', 'high', 'original', or 'auto'")

# Constraint 1: maximum dimension.
dim_scale = min(1.0, max_dim / max(w, h))

# Constraint 2: patch budget.
original_patches = math.ceil(w / 32) * math.ceil(h / 32)

if original_patches <= patch_budget:
patch_scale = 1.0
else:
shrink = math.sqrt((32**2 * patch_budget) / (w * h))

# OpenAI's docs describe an adjustment so the integer resized dimensions
# remain within the patch budget after 32px patch rounding.
patch_scale = shrink * min(
math.floor(w * shrink / 32) / (w * shrink / 32),
math.floor(h * shrink / 32) / (h * shrink / 32),
)

scale = min(dim_scale, patch_scale)

resized_w = math.floor(w * scale)
resized_h = math.floor(h * scale)

return math.ceil(resized_w / 32) * math.ceil(resized_h / 32)

def claude_tokens(w, h, max_long=2576):
    if max(w, h) > max_long:
        s = max_long / max(w, h); w, h = w*s, h*s
    return round(w * h / 750)

def gemini_tokens(w, h):
    if w <= 384 and h <= 384:
        return 258
    return 258 * math.ceil(w/768) * math.ceil(h/768)

Sources for Vision Token Counts

OpenAI vision and pricing: Images and vision guide, API pricing.

Anthropic Claude vision and pricing: Vision docs, pricing, Opus 4.7 announcement.

Google Gemini image understanding and pricing: Image understanding, tokens guide, pricing.

Cite this Post

Use the following entry to cite this post in your research:

Trevor Lynn. (May 4, 2026). Vision Token Counts: What does it cost to process an image with a frontier vision model?. Roboflow Blog: https://blog.roboflow.com/image-token-cost-vlm/

Stay Connected

Get the Latest in Computer Vision First

Model Playground

Compare VLM Models Side-by-Side

Written by

Trevor Lynn

Trevor leads Marketing at Roboflow. He focuses on sharing insights from Roboflow customers to inspire the broader AI community and help advance visual AI.

View more posts

Vision Token Counts: What does it cost to process an image with a frontier vision model?