PaliGemma is a multimodal vision model architecture developed by Google Research. The architecture was released in May 2024 with the ability to fine-tune PaliGemma models, and a series of existing weights that were trained on various benchmark datasets.

You can use PaliGemma to detect objects, segment objects, and for VQA.

Here is an example of the results from a fine-tuned model that can detect shipping containers and container information:

Note: This model was only trained for 512 steps. Training the model for more steps will increase accuracy.

We are excited to announce that you can now deploy PaliGemma models on your own hardware with Roboflow. This guide will walk through how to use the pre-existing PaliGemma weights, and how to upload your own weights for deployment.

Without further ado, let’s get started!

Deploy Fine-Tuned PaliGemma Weights

You can deploy PaliGemma models you have trained on your own data with Roboflow. This feature is designed specifically for object detection.

To deploy your model, you need to:

  1. Create a new project in Roboflow;
  2. Upload training data to Roboflow;
  3. Create a dataset version and download data;
  4. Train a fine-tuned PaliGemma model;
  5. Upload model weights, and;
  6. Deploy the model with Inference.

Step #1: Create a Roboflow Project

First, create a Roboflow account. Then, navigate to your account dashboard. Click the “Create Project” button to create a project, then fill out the on-page form.

Choose a name for your project, and select the “Object Detection” data type.

Step #2: Upload Project Data

Once you have created your project, upload the data you used to train your model, or that you plan to use to train your model. You can upload either raw images or data annotated in any of the supported object detection upload formats.

Drag and drop your data into the web interface. When the web application has processed the images (and annotations, if you have uploaded them), click “Save and Continue” to upload your data.

If you have any data that is not annotated, you can annotate it with Roboflow. Roboflow has a suite of features for data labeling.

You can use Auto Label to automatically label images without labels. Auto Label works best if you want to identify common objects.

You can also use the Roboflow web interface to draw bounding boxes. You can use the SAM-powered annotation helper to speed up manual annotation, too. With this feature, you can click on an object to draw a polygon around it. 

0:00
/0:10

Annotating using Smart Polygon.

This polygon will then be converted to a bounding box before training. This allows you to annotate an object in one click, instead of having to precisely draw a bounding box around the desired region.

Step #3: Create Dataset Version and Download Data

Once you have labeled your data, click “Generate” in the left sidebar to generate a dataset version. A dataset version is a snapshot of your data that is frozen in time.

You can apply preprocessing and augmentation steps to dataset versions. Learn more about how these steps can be useful in our preprocessing and augmentation guide.

Once you have added any preprocessing and augmentation steps, scroll to the bottom of the page and click “Create” to create your dataset.

A version of your dataset will then be generated. The amount of time this process takes will depend on the number of images in your dataset and the augmentations you have applied.

When your dataset has been generated, you can export it for use in training a PaliGemma object detection model.

Step #4: Fine-tune a PaliGemma Model

If you have already trained a model, you can skip this step.

When you upload a model to Roboflow, it will be available for deployment on your Roboflow Inference deployments and, Roboflow acts as a model registry. You can store your custom weights, then deploy them on your hardware with Inference.

To deploy your fine-tuned weights, you first need to have a fine-tuned PaliGemma model.

We have made an interactive notebook that shows how to fine-tune PaliGemma on a custom dataset. With this notebook, you can provide your own data in the PaliGemma format and use it to train a model that can be uploaded to Roboflow.

To train a PaliGemma model, you will need data in the PaliGemma JSONL format. You can export any object detection dataset in this format from Roboflow, then use the data to train a model.

Click "Export Dataset" on your project page to get the code you need to download your dataset in the requisite format for use in our notebook:

With your dataset and our notebook, you can train your PaliGemma model.

Step #5: Upload Model Weights

Once you have trained your model, you can upload it to Roboflow using the following code:

import roboflow
rf = Roboflow(api_key="API_KEY")
project = rf.workspace("workspace-id").project("project-id")
version = project.version(VERSION)
version.deploy(model_type="paligemma-3b-pt-224", model_path="/content/paligemma-lora")

Above, replace:

  • API_KEY with your Roboflow API key.
  • workspace-id and project-id with your workspace and project IDs.
  • VERSION with your project version.

If you are not using our notebook, replace /content/paligemma-lora with the directory where you saved your model weights.

When you run the code above, the model will be uploaded to Roboflow. It will take a few minutes for the model to be processed before it is ready for use.

Step #6: Deploy a Fine-tuned PaliGemma Model

When your model is ready, you can download it from Roboflow on any device on which you want to deploy your model. To do so, you can use Roboflow Inference, our open source computer vision inference server.

First, install inference:

pip install inference


Then, create a new Python file and add the following code:

import os
from inference import get_model
from PIL import Image
import json

lora_model = get_model("model-id/version-id", api_key="KEY")

image = Image.open("yard.jpg")
response = lora_model.infer(image)
print(response)

Above, replace:

  • model-id with your Roboflow model ID;
  • version-id with your project version, and;
  • KEY with your Roboflow API key.

Our model returns:

Our model successfully returns predictions.

Conclusion

PaliGemma is a multimodal vision model architecture developed by Google. You can use PaliGemma for a range of tasks, from visual question answering to object detection.

In this guide, we demonstrated how you can use some of the fine-tuned PaliGemma models released by Google with Inference. We also walked through how you can upload fine-tuned object detection models to Roboflow for deployment with Inference.

To learn more about deploying vision models with Roboflow, refer to the Inference documentation.