
On April 16th, 2025, OpenAI released two new multimodal models: o3 and o4-mini. These models are part of the “reasoning” series of models, which started with O1 last year. o3 and o4-mini, in contrast to previous reasoning models, now “can integrate images directly into their chain of thought.” This means you can provide images as context in a prompt.
We ran a qualitative analysis of both o3 and o4-mini using our standard set of prompts to evaluate the general capabilities of the model. Here are our results:
In summary:
- o4-mini passed four of seven tests. It almost passed the OCR test, but failed to add italics where relevant. It almost passed the object counting test, but its answer was correct once and incorrect when run again.
- o3 Mini passed three of seven tests.
Surprisingly, both models failed at object counting, a task with which most other multimodal models we have tested can do well.
In this guide, we will walk through the prompts we run, the answers from the models, as well as our analysis from using the model.
Without further ado, let’s get started!
What is OpenAI o3 and o4-mini?
o3 and o4-mini are the latest reasoning models released by OpenAI. The models, available via the OpenAI API and Playground as of April 16th, 2025, allow you to ask questions and upload images as context. The model will then use its reasoning abilities to return an answer.
In OpenAI’s model playground, o3 is described as OpenAI’s “most powerful reasoning model with leading performance on coding, math, science, and vision.” o4-mini is described as their “faster, cost-efficient reasoning model delivering strong performance on math, coding, and vision.”
OpenAI notes that both o3 and o4-mini use a “chain of thought” mechanism, in which the model asks itself questions to find an answer to a given prompt. This is especially useful for questions that require more analysis.
OpenAI found o3 achieves the best performance when compared to their previous models on visual math reasoning, scientific figure reasoning, scientific figure description, college-level visual problem-solving, visual search, and visual perception primitives. The full results are described in OpenAI’s “Thinking with images” blog post released with the models.
Evaluating OpenAI’s o3 and o4-mini models
We evaluated OpenAI’s o3 and o4-mini multimodal support on several tasks, including:
- Object counting
- Visual question answering (VQA)
- Document OCR
- Document VQA
- Real-world OCR
- Zero-shot object detection
Below, we show our results from each test.
Our tests were run using the ChatGPT playground.
o3 and o4-mini Object Counting
We started by asking o3 and o4-mini to count the number of coins in an image. We asked the model “How many coins do I have?” with the following image:
o4-mini first returned:
You have five coins in the picture.
When prompted again, the model returned:
There are four coins in the picture.
The model at first failed to answer the question correctly. When given exactly the same prompt again, the answer was correct. This variability is interesting. In production applications, such variability is potentially detrimental.
o3 failed this test answering:
You have five coins.
This is the first multimodal model in several months that we have seen fail the counting task.
o3 and o4-mini Visual Question Answering
We next tested the models’ visual question answering abilities. We asked the question “Which movie is this scene from?” with the following image as context:
o4-mini returned:
This is from the movie “Home Alone.”
o3 returned:
This scene is from the movie “Home Alone.”
Both models answered the question correctly.
o3 and o4-mini Document OCR
We then tested o3 and o4-mini’s document OCR abilities. We uploaded a screenshot with several lines of text and asked the model to “Read text from image.” We provided the following image:
o4-mini returned:
I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift’s Eras. Meta indeed. I started listening to Ms. Swift’s music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.
o3 returned:
I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.
The words are all correct. With that said, both models missed the italics on the word “Midnights”. We have seen this behaviour in the past where a model successfully reads text but misses the italics.
GPT-4.1 nano – a new GPT model released earlier this week – got the answer exactly current, including with the italics.
o3 and o4-mini Document Question Answering
We then uploaded a photo of a menu and asked “How much does Pastrami Pizza cost?”:
o4-mini answered:
The Pastrami Pizza is priced at $27.
o3 answered:
The Pastrami Pizza costs $27.
Both models were correct.
We then asked the model to answer a question about the contents of a receipt. Given the following image of a receipt, we asked “How much tax did I pay?”:
o4-mini responded:
You paid $2.30 in tax.
o3 responded:
You paid $0.30 in tax.
o4-mini was correct. o3 was incorrect. Of note, o3-mini also got this question wrong, but GPT-4.5 got it right. This hints at how reasoning may not be appropriate for some use cases.
o3 and o4-mini Real-World OCR
We then asked o3 and o4-mini to do a real world OCR task. We provided a photo of a tire with a serial number and asked the model to read the serial number. Here is the image we provided:
We provided the prompt “Read the serial number.”
GPT o4-mini returned:
The serial number reads: 3702692432.
o3 returned:
The serial number shown is: 3702692432
Both models were correct.
o3 and o4-mini Object Detection
Finally, we challenged both models to detect the location of an object in an image. This is a task with which many multimodal models struggle. We provided the following image and asked each model to “Return the xyxy coordinates of the dog in the image.”
o4-mini returned:
Here’s an approximate bounding box for the dog in (x_min, y_min, x_max, y_max) format:
[200, 400, 800, 1400]
We plotted these coordinates on the input image to visualise them using the supervision Python package. The results are:
We are counting this as a fail because, while the dog is captured, the bounding box contains a lot of background and misses part of the dog.
o3 returned:
0 300 450 1000
When plotted on an image, the results are:
Conclusion
o3 and o4-mini are multimodal reasoning models developed by OpenAI. These models use chain-of-thought processing to develop more reasoned answers to a question.
o4-mini passed four of seven tests. o3 Mini passed three of seven tests. The performance of both reasoning models was worse than GPT-4.1 and GPT-4.1 Nano, OpenAI’s latest multimodal models. GPT-4.1 and GPT-4.1 Nano do not have reasoning capabilities.
Curious to learn more about state-of-the-art multimodal models? Check out our multimodal model guides.
Cite this Post
Use the following entry to cite this post in your research:
James Gallagher. (Apr 17, 2025). OpenAI o3 and o4-mini: Multimodal and Vision Analysis. Roboflow Blog: https://blog.roboflow.com/openai-o3-and-o4-mini/