Document Understanding with Multimodal Models

Published Jul 12, 2024 • 3 min read

Multimodal vision models allow you to ask a question about the contents of an image. For example, you can provide an image of a receipt or invoice to a multimodal model and ask for information about how much different products cost.

PaliGemma, a multimodal vision model architecture developed by Google, was released with several models that can be used out-of-the-box for multimodal tasks. One of these models was trained specifically for document understanding and question answering, allowing superior performance.

Compared to models like GPT-4 with Vision and Claude-3, PaliGemma can run on your own hardware.

Consider the following image of a receipt:

When given the prompt “Was the invoice paid?”, PaliGemma returned "yes".

In this guide, we are going to walk through how to use PaliGemma for document understanding tasks. We will use Roboflow Inference to deploy the model.

Without further ado, let’s get started!

Step #1: Install Inference

Roboflow Inference is a high-performance vision inference server. You can use Roboflow Inference to run state-of-the-art computer vision models like PaliGemma or YOLOv10.

You can run Roboflow Inference through the Inference Python package or as a microservice deployed with Docker. For this guide, we will show how to deploy Inference with the Python package.

To install the Inference Python package, run the following command:

pip install git+https://github.com/roboflow/inference --upgrade -q

We also need to install a few additional dependencies that PaliGemma model will use:

pip install transformers>=4.41.1 accelerate onnx peft timm flash_attn einops -q

With Inference installed, we can start building logic to use PaliGemma for document understanding.

Step #2: Ask Questions About a Document

We are now ready to build a document understanding system with PaliGemma.

Create a new Python file and add the following code:

import os
from inference import get_model
from PIL import Image
import json

lora_model = get_model("paligemma-3b-ft-docvqa-448", api_key="")

In this file, we import Inference and load the paligemma-3b-ft-docvqa-448 model weights. These model weights were fine-tuned on document understanding data, which should allow for better performance on document understanding tasks when compared to more general models.

Above, replace KEY with your Roboflow API key. Learn how to retrieve your Roboflow API key.

To run the model, use the following code:

image = Image.open("invoice.png")
response = lora_model.infer(image, prompt="who sent this invoice?")
print(response)

💡

This model is large and will take several minutes to download.

Let’s run the script above with the prompt “Who sent this invoice?” and the following image of a printed invoice:

The model returns:

wework community workspace uk limited

PaliGemma returned the correct answer to our question.

PaliGemma performs best when asked questions about specific parts of a document in separate prompts. Let's ask a few questions.

Question: What is the address?

Answer: 10 York Road

Question: What is the invoice city?

Answer: London

Question: What is the invoice post code?

Answer: SE1 9JR

Note: The model got the last three characters of this question wrong.

Question: What is the invoice date?

Answer: 26 june 2024

Now let's ask about the total of an invoice:

When asked "what is the invoice cost after tax?", the model returns "54". When asked "what is the invoice cost before tax?", the model returns "45.00". In both cases, the model is correct.

We encourage you to test PaliGemma on different document types to evaluate performance of the model on your own data. This testing is essential to understand whether the model can perform well on your data.

Conclusion

PaliGemma is a multimodal vision architecture developed by Google. PaliGemma was released with several pre-trained models that are fine-tuned on specific tasks.

In this guide, we demonstrated how to use the PaliGemma document understanding model to ask questions about the contents of an image. We downloaded and installed inference, then used Inference to ask questions about the contents of a document.

To learn more about running PaliGemma models with Inference, refer to the PaliGemma Inference documentation. To learn more about the PaliGemma model architecture and what the series of models can do, refer to our introductory guide to PaliGemma.

If you need assistance integrating PaliGemma into an enterprise application, contact the Roboflow sales team. Our sales and field engineering teams have extensive experience advising on the integration of vision and multimodal models into business applications.

Cite this Post

Use the following entry to cite this post in your research:

James Gallagher. (Jul 12, 2024). Document Understanding with Multimodal Models. Roboflow Blog: https://blog.roboflow.com/multimodal-document-understanding/

Stay Connected

Get the Latest in Computer Vision First

Model Playground

Compare VLM Models Side-by-Side

Written by

James Gallagher

James is a technical writer at Roboflow, with experience writing documentation on how to train and use state-of-the-art computer vision models.

View more posts

Topics

Multimodal

Document Understanding with Multimodal Models

Step #1: Install Inference

Step #2: Ask Questions About a Document

Conclusion

Cite this Post

Written by

Topics

More About

Use Qwen2.5-VL for Zero-Shot Object Detection

How to Fine-Tune a SmolVLM2 Model on a Custom Dataset

OpenAI o3-pro: Multimodal and Vision Analysis

OpenAI o3 and o4-mini: Multimodal and Vision Analysis

OpenAI GPT-4.1: Multimodal and Vision Analysis

Gemma 3: Multimodal and Vision Analysis