Blog

Multimodal

Latest Posts Case Studies Product Updates Logistics Manufacturing

Which is the Best Coding Agent for Vision tasks?

16 Mar 2026 • 5 min read

Which is the Best Coding Agent for Vision tasks?

Coding agents are quickly becoming the popular way to build applications, as they generate code, run it, debug errors, and iterate autonomously. But how well do they perform on tasks related to visual understanding and vision applications? We evaluated the top 4 coding agents (claude code, gemini-cli, openai codex, Cursor)

Best Multimodal Models in 2026

4 Feb 2026 • 7 min read

Best Multimodal Models in 2026

From SAM 3’s record-breaking segmentation speed to Gemini 3’s massive 2-million-token context window, explore the top models that can "see," reason, and deploy in production today.

Launch: Use Segment Anything 3 (SAM 3) with Roboflow

19 Nov 2025 • 5 min read

Launch: Use Segment Anything 3 (SAM 3) with Roboflow

Today, we are introducing new tools powered by Segment Anything 3 (SAM 3) that significantly change how people build computer vision applications. SAM 3 is a powerful vision foundation model that detects, segments, and tracks objects in images and videos based on prompts. See our detailed SAM 3 model overview

How to Detect, Track, and Identify Basketball Players with Computer Vision

30 Sep 2025 • 9 min read

How to Detect, Track, and Identify Basketball Players with Computer Vision

Build a computer vision pipeline that detects, tracks, and identifies NBA players during a game. Using models like RF-DETR, SAM2, SigLIP, SmolVLM2, and ResNet, the system handles motion blur, occlusion, and uniform similarity to reliably map jersey numbers to player identities.

How to Fine-Tune Qwen2.5-VL with a Custom Dataset

26 Aug 2025 • 9 min read

How to Fine-Tune Qwen2.5-VL with a Custom Dataset

Learn how to fine-tune Qwen2.5-VL for document processing using a custom dataset.

GPT-5 for Vision: Results from 80+ Real-World Tests

7 Aug 2025 • 4 min read

GPT-5 for Vision: Results from 80+ Real-World Tests

On August 7th, 2025, OpenAI released GPT-5, the newest model in their GPT series. GPT-5 has advanced reasoning capabilities and, like many recent models by OpenAI, multimodal support. This means that you can both prompt GPT-5 with one or more images and ask for an answer, but also prompt the

Use Qwen2.5-VL for Zero-Shot Object Detection

18 Jul 2025 • 5 min read

Use Qwen2.5-VL for Zero-Shot Object Detection

Qwen2.5-VL is the latest addition to the Qwen vision-language model series, offering cutting-edge capabilities for image, text, and document understanding. Available in three model sizes—3B, 7B, and 72B—it excels at tasks such as object detection, OCR for multi-language and rotated text, and structured data extraction from complex

How to Fine-Tune a SmolVLM2 Model on a Custom Dataset

23 Jun 2025 • 7 min read

How to Fine-Tune a SmolVLM2 Model on a Custom Dataset

Learn how to fine-tune SmolVLM2 for visual question answering using a custom dataset.

OpenAI o3-pro: Multimodal and Vision Analysis

11 Jun 2025 • 4 min read

OpenAI o3-pro: Multimodal and Vision Analysis

Explore how OpenAI o3-pro does on a range of use cases, from defect detection to object counting to VQA.

OpenAI o3 and o4-mini: Multimodal and Vision Analysis

17 Apr 2025 • 6 min read

OpenAI o3 and o4-mini: Multimodal and Vision Analysis

Read our analysis of how OpenAI's O3 and O4-Mini models perform on a range of vision tasks.

OpenAI GPT-4.1: Multimodal and Vision Analysis

15 Apr 2025 • 5 min read

OpenAI GPT-4.1: Multimodal and Vision Analysis

Read our analysis of OpenAI's GPT-4.1 model on multimodal tasks like VQA, object detection, and more.

Gemma 3: Multimodal and Vision Analysis

13 Mar 2025 • 5 min read

Gemma 3: Multimodal and Vision Analysis

Read our analysis of Google's Gemma-3 model and how it performs on a variety of common multimodal vision AI tasks.

Foundational Few-Shot Object Detection Challenge [CVPR 2025]

13 Mar 2025 • 1 min read

Foundational Few-Shot Object Detection Challenge [CVPR 2025]

Roboflow & Carnegie Mellon University are releasing the second iteration of the Foundational Few-Shot Object Detection Challenge at CVPR 2025.

SmolVLM2: Multimodal and Vision Analysis

11 Mar 2025 • 5 min read

SmolVLM2: Multimodal and Vision Analysis

Read our analysis of how SmolVLM2 performs on a range of multimodal vision tasks.

Moondream 2: Multimodal and Vision Analysis

11 Mar 2025 • 6 min read

Moondream 2: Multimodal and Vision Analysis

Read our analysis of how the multimodal Moondream 2 model performs on a range of vision tasks.

OpenAI o3-mini: Vision and Multimodal Features

13 Feb 2025 • 7 min read

OpenAI o3-mini: Vision and Multimodal Features

Read our analysis of how OpenAI's O3 Mini model performs on various computer vision tasks.

An introduction and approaches to few-shot learning with Roboflow

2 Jan 2025 • 8 min read

What is Few-Shot Learning?

In this blog post, we discuss what few-shot learning is, architectural approaches for implementing few-shot learning, and specific implementations of few-shot learning techniques.

How to Fine-tune PaliGemma 2

10 Dec 2024 • 13 min read

How to Fine-tune PaliGemma 2

Learn how to fine-tune PaliGemma 2 to extract data from an image in JSON format.

Launch: Use Florence-2 in Roboflow Workflows

16 Oct 2024 • 6 min read

Launch: Use Florence-2 in Roboflow Workflows

Learn how to use Florence-2 in Roboflow Workflows for zero-shot object detection, OCR, and more.

Table and Figure Understanding with Computer Vision

3 Sep 2024 • 7 min read

Table and Figure Understanding with Computer Vision

Learn how to use Roboflow Workflows and multimodal models to derive information about the contents of tables and figures.

CLIP: Connecting text and images

1 Sep 2024 • 15 min read

What is CLIP? Contrastive Language-Image Pre-Training Explained.

CLIP is an open source, multimodal computer vision model developed by OpenAI. Learn what CLIP is in this guide.

How to OCR Hand-Written Notes with GPT-4

22 Jul 2024 • 7 min read

How to OCR Hand-Written Notes with GPT-4

Learn how to OCR hand-written notes with GPT-4.

Document Understanding with Multimodal Models

12 Jul 2024 • 3 min read

Document Understanding with Multimodal Models

Learn how to use the PaliGemma multimodal model to ask questions about the contents of a document.

Visual Question Answering with Multimodal Models

12 Jul 2024 • 3 min read

Visual Question Answering with Multimodal Models

Learn how to use the PaliGemma multimodal model to ask questions about images.

Stay Connected

Get the Latest in Computer Vision First