Blog

Multimodal

Latest Posts Case Studies Product Updates Logistics Manufacturing

How to Fine-Tune a SmolVLM2 Model on a Custom Dataset

23 Jun 2025 • 7 min read

How to Fine-Tune a SmolVLM2 Model on a Custom Dataset

Learn how to fine-tune SmolVLM2 for visual question answering using a custom dataset.

OpenAI o3-pro: Multimodal and Vision Analysis

11 Jun 2025 • 4 min read

OpenAI o3-pro: Multimodal and Vision Analysis

Explore how OpenAI o3-pro does on a range of use cases, from defect detection to object counting to VQA.

OpenAI o3 and o4-mini: Multimodal and Vision Analysis

17 Apr 2025 • 6 min read

OpenAI o3 and o4-mini: Multimodal and Vision Analysis

Read our analysis of how OpenAI's O3 and O4-Mini models perform on a range of vision tasks.

OpenAI GPT-4.1: Multimodal and Vision Analysis

15 Apr 2025 • 5 min read

OpenAI GPT-4.1: Multimodal and Vision Analysis

Read our analysis of OpenAI's GPT-4.1 model on multimodal tasks like VQA, object detection, and more.

Gemma 3: Multimodal and Vision Analysis

13 Mar 2025 • 5 min read

Gemma 3: Multimodal and Vision Analysis

Read our analysis of Google's Gemma-3 model and how it performs on a variety of common multimodal vision AI tasks.

Foundational Few-Shot Object Detection Challenge [CVPR 2025]

13 Mar 2025 • 1 min read

Foundational Few-Shot Object Detection Challenge [CVPR 2025]

Roboflow & Carnegie Mellon University are releasing the second iteration of the Foundational Few-Shot Object Detection Challenge at CVPR 2025.

SmolVLM2: Multimodal and Vision Analysis

11 Mar 2025 • 5 min read

SmolVLM2: Multimodal and Vision Analysis

Read our analysis of how SmolVLM2 performs on a range of multimodal vision tasks.

Moondream 2: Multimodal and Vision Analysis

11 Mar 2025 • 6 min read

Moondream 2: Multimodal and Vision Analysis

Read our analysis of how the multimodal Moondream 2 model performs on a range of vision tasks.

OpenAI o3-mini: Vision and Multimodal Features

13 Feb 2025 • 7 min read

OpenAI o3-mini: Vision and Multimodal Features

Read our analysis of how OpenAI's O3 Mini model performs on various computer vision tasks.

An introduction and approaches to few-shot learning with Roboflow

2 Jan 2025 • 8 min read

What is Few-Shot Learning?

In this blog post, we discuss what few-shot learning is, architectural approaches for implementing few-shot learning, and specific implementations of few-shot learning techniques.

How to Fine-tune PaliGemma 2

10 Dec 2024 • 13 min read

How to Fine-tune PaliGemma 2

Learn how to fine-tune PaliGemma 2 to extract data from an image in JSON format.

Launch: Use Florence-2 in Roboflow Workflows

16 Oct 2024 • 6 min read

Launch: Use Florence-2 in Roboflow Workflows

Learn how to use Florence-2 in Roboflow Workflows for zero-shot object detection, OCR, and more.

Table and Figure Understanding with Computer Vision

3 Sep 2024 • 7 min read

Table and Figure Understanding with Computer Vision

Learn how to use Roboflow Workflows and multimodal models to derive information about the contents of tables and figures.

CLIP: Connecting text and images

1 Sep 2024 • 15 min read

What is CLIP? Contrastive Language-Image Pre-Training Explained.

CLIP is an open source, multimodal computer vision model developed by OpenAI. Learn what CLIP is in this guide.

How to OCR Hand-Written Notes with GPT-4

22 Jul 2024 • 7 min read

How to OCR Hand-Written Notes with GPT-4

Learn how to OCR hand-written notes with GPT-4.

Document Understanding with Multimodal Models

12 Jul 2024 • 3 min read

Document Understanding with Multimodal Models

Learn how to use the PaliGemma multimodal model to ask questions about the contents of a document.

Visual Question Answering with Multimodal Models

12 Jul 2024 • 3 min read

Visual Question Answering with Multimodal Models

Learn how to use the PaliGemma multimodal model to ask questions about images.

Understand Website Screenshots with a Multimodal Vision Model

12 Jul 2024 • 4 min read

Understand Website Screenshots with a Multimodal Vision Model

Learn how to use the Florence-2 multimodal model to generate rich descriptions of website screenshots.

How to Caption Images with a Multimodal Vision Model

12 Jul 2024 • 4 min read

How to Caption Images with a Multimodal Vision Model

Learn how to caption images using a multimodal vision model.

How to Use Florence-2 for Optical Character Recognition

10 Jul 2024 • 5 min read

How to Use Florence-2 for Optical Character Recognition

Learn how to use the Florence-2 model for Optical Character Recognition tasks.

What is Dense Image Captioning?

10 Jul 2024 • 4 min read

What is Dense Image Captioning?

Learn what dense image captioning is and how to use the MIT-licensed Florence-2 model to generate dense image captions.

How to Fine-tune Florence-2 for Object Detection Tasks

25 Jun 2024 • 12 min read

How to Fine-tune Florence-2 for Object Detection Tasks

This tutorial will show you how to fine-tune Florence-2 on object detection datasets to improve model performance for your specific use case.

Florence-2: Vision-language Model

20 Jun 2024 • 5 min read

Florence-2: Vision-language Model

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license.

How to Fine-tune PaliGemma for Object Detection Tasks

17 May 2024 • 7 min read

How to Fine-tune PaliGemma for Object Detection Tasks

Learn how to fine-tune the PaliGemma multimodal model to detect custom objects.

Stay Connected

Get the Latest in Computer Vision First