How to Use Gemini for OCR

Published Feb 4, 2025 • 5 min read

Vision AI models like many of those in Google’s Gemini series are at the frontier of optical character recognition (OCR).

With Gemini Flash, for example, you can upload an image and retrieve all of the text in the image. This works for screenshots, handwriting, photos of documents, and more.

In this guide, we are going to walk through how to build an AI workflow that uses Google’s Gemini model for OCR.

By the end of this guide, we will be able to read the contents of the following receipt with Gemini:

Without further ado, let’s get started!

💡

You can now try Gemini for free, no login required, with the Model Playground.

Prerequisites

To follow this guide, you will need:

Step #1: Create a Workflow

For this guide, we are going to use Roboflow Workflows, a web-based application builder for visual AI tasks.

With Workflows, you can chain together multiple tasks – from identifying objects in images with state-of-the-art detection models to asking questions with visual language models – to build multi-step applications.

Open the Roboflow dashboard then click “Workflows” in the right sidebar. Then, create a new Workflow.

You will be taken to a blank Workflow editor:

Step #2: Add a Multimodal Model Block

Workflows supports many state-of-the-art vision AI models for use in reading text in images. For this guide, we will use Google’s Gemini series.

Click “Add Block” in the Workflows editor, then search for Gemini:

A configuration window will appear in which you can set up a prompt for Gemini.

For this guide, we are going to use the Structured Output Generation. This means that we will write a text prompt to pass to the model and our results will be returned in a JSON form.

Here is what your Gemini configuration should look like:

For this guide, we are going to use the Structured Output Generation method of prompting Gemini. This lets you provide a JSON structure that Gemini will use to form a response. Let’s use the following structure:

{
  "location": "",
  "time": "",
  "date": "",
  "transactions": "",
  "total_cost": "",
  "business_name": "",
}

This will be sent to Gemini when our Workflow runs to say exactly what information we want to retrieve, and in what structure.

You can choose which Gemini model to use from the Model Version dropdown:

Once you have configured the multimodal model you are using, click “Save”.

Step #3: Build Custom Logic

With our invoice reading system ready, we can start to build custom logic that uses the results from the invoice OCR step. For example, we can send a notification to Slack with the results from our system.

For this step, we are going to add three blocks to our Workflow:

Property Definition, which we will use to take our input image and convert it into a JPEG that can be sent to Slack.
JSON Parser, to read our Gemini output and turn it into JSON that our Workflow can understand, and;
Slack Notification, which will send a notification to Slack.

Our Workflow will look like this:

Let's talk through each block step-by-step.

Property Definition

Add a new Property Definition block to your Workflow, then configure it to accept the input image and process it with the "Image to JPEG" operation:

JSON Parser

Add a new JSON Parser block to your Workflow. Configure the Expected Fields value to be the keys that you defined in the Gemini block earlier that you want to use in your Workflow.

For this guide, we'll parse the location, time, date, total_cost, and business_name returned from the Gemini block.

Slack Notification

Next, add a Slack Notification block.

First, you will need to configure the block with:

A Slack token with permission to write to Slack.
The channel ID to which you want to send messages.

You can read how to find these values in our How to Send a Slack Notification with Roboflow Workflows guide.

Next, set the message content to:

A receipt dated {{ $parameters.date }} from {{ $parameters.business_name }} was received. {{ $parameters.total_cost }} was spent.

This is the template for the message to send to Slack.

Finally, click "Add Property" and add properties for each value you want to be readable in the message, like this:

Here, we create three properties:

business_name
total_cost
date

These are accessible with the {{ $parameters.business_name }}, {{ $parameters.total_cost }} and {{ $parameters.date }} variables in our Slack message template.

Finally, set the Attachments value to:

{
  "image.jpg": "$steps.property_definition.output"
}

Step #4: Test the Workflow

We are now ready to test our AI receipt reading application.

Click “Test Workflow” in the top right corner of the Workflows application, then drag and drop an image that you want to use:

Click “Run” to run the Workflow.

Our Workflow returns all of the information we requested:

Gemini successfully ran OCR on our receipt and returned text from the image.

We also received a Slack notification with our receipt information:

So far, we have tested our application in the browser. But, you can call your Workflow from anywhere. Note: Since this Workflow depends on Gemini, you will need an internet-connected device to run it.

Click “Deploy” at the top of the Workflows editor to see code snippets that show how to call a cloud API using your Workflow or deploy your Workflow on your own system.

Here is an example that shows how to call a Workflow from the Roboflow Cloud:


from inference_sdk import InferenceHTTPClient

client = InferenceHTTPClient(
	api_url="https://detect.roboflow.com",
	api_key="API_KEY"
)

result = client.run_workflow(
	workspace_name="WORKSPACE-NAME",
	workflow_id="WORKFLOW-ID",
	images={
    	"image": "YOUR_IMAGE.jpg"
	},
	use_cache=True # cache workflow definition for 15 minutes
)

Conclusion

With models like Gemini, you can run OCR on images. In this guide, we showed how to use Roboflow Workflows, a web-based application builder, to create a receipt reading application that uses Google’s Gemini series.

You could extend the example in this guide to do more. For example, you could change your prompt to extract specific information from a receipt, such as the total cost of a transaction.

To learn more about building with Roboflow Workflows, check out the Workflows launch guide.

Cite this Post

Use the following entry to cite this post in your research:

James Gallagher. (Feb 4, 2025). How to Use Gemini for OCR. Roboflow Blog: https://blog.roboflow.com/how-to-use-gemini-for-ocr/

Stay Connected

Get the Latest in Computer Vision First

Written by

James Gallagher

James is a technical writer at Roboflow, with experience writing documentation on how to train and use state-of-the-art computer vision models.

View more posts

How to Use Gemini for OCR

Prerequisites

Step #1: Create a Workflow

Step #2: Add a Multimodal Model Block

Step #3: Build Custom Logic

Property Definition

JSON Parser

Slack Notification

Step #4: Test the Workflow

Conclusion

Cite this Post

Written by

Topics

More About

How to use Llama 3.2 Vision for OCR

How to Integrate Roboflow with Ignition with OPC UA

Using AI to Score Split the G

How to Extract Data from Tables with AI

How to Read Receipts with AI

How to Read an Invoice with AI