Understand Website Screenshots with a Multimodal Vision Model

Multimodal vision models allow you to provide an image and retrieve information about an image in several forms. Florence-2, a state-of-the-art multimodal model architecture, has the ability to generate rich descriptions of images.

Using multimodal models, you can compute information that can be used to build AI agents to navigate web pages, or better index website information for search applications.

In this guide, we are going to walk through how to generate text descriptions of website screenshots using Florence-2. We will show you how to run the model on your own hardware with HuggingFace Transformers.

Consider the following image:

Here is an example description of the image that was generated by Florence-2:

The image is a screenshot of the homepage of a website called Roboflow. The website has a purple and blue color scheme with a white background. On the left side of the page, there is a title that reads "Everything you need to build and deploy computer vision models". Below the title, there are two buttons - "Product", "Solutions", "Resources", "Pricing", "Docs", and "Talk to Sales".

On the right side of this page, we can see a construction site with workers wearing high visibility vests and hard hats. There are orange cones and construction equipment scattered around the site. The site appears to be under construction, as there are buildings and scaffolding visible in the background.

At the bottom right corner of the image, there has a button that says "Get Started" and a link to the website's website.'

We will also show how to run OCR to retrieve text on particular parts of a web page.

Without further ado, let’s get started!

Step #1: Install Dependencies

In this guide, we will use HuggingFace Transformers and the timm image package to load Florence-2. To get started, we need to install the required dependencies:

pip install transformers timm flash_attn einops

Once you have installed the required dependencies, you can start generating image captions.

Step #2: Generate a Website Screenshot Description

With Inference installed, we can now start to generate descriptions of website screenshots. For this, we are going to use the image captioning capabilities of Florence-2.

Create a new Python file and add the following code:

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import copy

model_id = 'microsoft/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

def run_example(task_prompt, text_input=None):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input

    inputs = processor(text=prompt, images=image, return_tensors="pt")
    generated_ids = model.generate(
        input_ids=inputs["input_ids"].cuda(),
        pixel_values=inputs["pixel_values"].cuda(),
        max_new_tokens=1024,
        early_stopping=False,
        do_sample=False,
        num_beams=3,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )

    return parsed_answer

In this code, we import Transformers and load the large Florence-2 model. We then define a function that runs Florence-2 on an image and returns the processed answer.

To run the model on an image, we can use the following code:

image = Image.open("image.jpeg").convert("RGB")

task_prompt = "<MORE_DETAILED_CAPTION>"
answer = run_example(task_prompt=task_prompt)

print(answer)

When you first run the script, the Florecne-2 weights will be downloaded onto your system. The download process may take a few minutes, depending on the strength of your internet connection. The weights will be cached to your device so that they can be loaded, rather than downloaded, on subsequent runs.

Let’s run our code.

When given the following image:

Florence-2 returns the following description:

The image is a screenshot of the homepage of a website called Roboflow. The website has a purple and white color scheme with a navigation bar at the top. Below the navigation bar, there is a navigation menu with options such as Product, Solutions, Resources, Pricing, Docs, and Sign Up.

On the left side of the page, there are two tabs - "Fine-tune Florence-2 for Object Detection with Custom Data" and "How to Train YOLOV10 Model on a Custom Dataset". On the right side, the page has a title that reads "Roboflow" and a brief description of the website\'s features.

The main content of the webpage is divided into two sections. The first section is titled "Fine Tune Florence 2" and has an image of a computer screen with a graph and a line graph on it. The second section has a description of how to train YOLovi10 model on a custom dataset. The text below the title explains that the website offers a tutorial on how to improve the performance of an object detection system with custom data.

The above description captures the contents of the image. With that said, the model makes mistakes related to the positioning of elements. It says that there are two tabs on the left side of the page, but there are two tabs that take up the whole of the top viewport. It also says that there is a description of the website features, which is not accurate.

The above example both illustrates the strengths and weaknesses of the model. The caption identifies large amounts of information: the website name, the colour scheme, the navigation bar, the contents on the page and how they relate. But the model struggles with direction.

This model is best used for applications that don't require understanding of the spatial relations between elements on a page. For example, it could be used to generate a description of website screenshots for use in building a local screenshot information retrieval system.

Conclusion

With Florence-2, you can generate descriptions of website screenshots. Such descriptions may be useful for building information retrieval applications. For example, you could build a system that lets you search over screenshots on your desktop using captions generated by the system.

In this guide, we walked through how to generate website screenshots with Florecne-2. We installed HuggingFace Transformers, downloaded and initialized a Florence-2 model, then ran the model on an example image.

To learn more about the Florence-2 model architecture what the model can do, refer to our introductory guide to Florence-2.

If you need assistance integrating Florence-2 into an enterprise application, contact the Roboflow sales team. Our sales and field engineering teams have extensive experience advising on the integration of vision and multimodal models into business applications.