26 Aug 2025 • 9 min read How to Fine-Tune Qwen2.5-VL with a Custom Dataset Learn how to fine-tune Qwen2.5-VL for document processing using a custom dataset.
7 Aug 2025 • 4 min read GPT-5 for Vision: Results from 80+ Real-World Tests On August 7th, 2025, OpenAI released GPT-5, the newest model in their GPT series. GPT-5 has advanced reasoning capabilities and, like many recent models by OpenAI, multimodal support. This means that you can both prompt GPT-5 with one or more images and ask for an answer, but also prompt the
18 Jul 2025 • 5 min read Use Qwen2.5-VL for Zero-Shot Object Detection Qwen2.5-VL is the latest addition to the Qwen vision-language model series, offering cutting-edge capabilities for image, text, and document understanding. Available in three model sizes—3B, 7B, and 72B—it excels at tasks such as object detection, OCR for multi-language and rotated text, and structured data extraction from complex
23 Jun 2025 • 7 min read How to Fine-Tune a SmolVLM2 Model on a Custom Dataset Learn how to fine-tune SmolVLM2 for visual question answering using a custom dataset.
11 Jun 2025 • 4 min read OpenAI o3-pro: Multimodal and Vision Analysis Explore how OpenAI o3-pro does on a range of use cases, from defect detection to object counting to VQA.
17 Apr 2025 • 6 min read OpenAI o3 and o4-mini: Multimodal and Vision Analysis Read our analysis of how OpenAI's O3 and O4-Mini models perform on a range of vision tasks.
15 Apr 2025 • 5 min read OpenAI GPT-4.1: Multimodal and Vision Analysis Read our analysis of OpenAI's GPT-4.1 model on multimodal tasks like VQA, object detection, and more.
13 Mar 2025 • 5 min read Gemma 3: Multimodal and Vision Analysis Read our analysis of Google's Gemma-3 model and how it performs on a variety of common multimodal vision AI tasks.
13 Mar 2025 • 1 min read Foundational Few-Shot Object Detection Challenge [CVPR 2025] Roboflow & Carnegie Mellon University are releasing the second iteration of the Foundational Few-Shot Object Detection Challenge at CVPR 2025.
11 Mar 2025 • 5 min read SmolVLM2: Multimodal and Vision Analysis Read our analysis of how SmolVLM2 performs on a range of multimodal vision tasks.
11 Mar 2025 • 6 min read Moondream 2: Multimodal and Vision Analysis Read our analysis of how the multimodal Moondream 2 model performs on a range of vision tasks.
13 Feb 2025 • 7 min read OpenAI o3-mini: Vision and Multimodal Features Read our analysis of how OpenAI's O3 Mini model performs on various computer vision tasks.
2 Jan 2025 • 8 min read What is Few-Shot Learning? In this blog post, we discuss what few-shot learning is, architectural approaches for implementing few-shot learning, and specific implementations of few-shot learning techniques.
10 Dec 2024 • 13 min read How to Fine-tune PaliGemma 2 Learn how to fine-tune PaliGemma 2 to extract data from an image in JSON format.
16 Oct 2024 • 6 min read Launch: Use Florence-2 in Roboflow Workflows Learn how to use Florence-2 in Roboflow Workflows for zero-shot object detection, OCR, and more.
3 Sep 2024 • 7 min read Table and Figure Understanding with Computer Vision Learn how to use Roboflow Workflows and multimodal models to derive information about the contents of tables and figures.
1 Sep 2024 • 15 min read What is CLIP? Contrastive Language-Image Pre-Training Explained. CLIP is an open source, multimodal computer vision model developed by OpenAI. Learn what CLIP is in this guide.
22 Jul 2024 • 7 min read How to OCR Hand-Written Notes with GPT-4 Learn how to OCR hand-written notes with GPT-4.
12 Jul 2024 • 3 min read Document Understanding with Multimodal Models Learn how to use the PaliGemma multimodal model to ask questions about the contents of a document.
12 Jul 2024 • 3 min read Visual Question Answering with Multimodal Models Learn how to use the PaliGemma multimodal model to ask questions about images.
12 Jul 2024 • 4 min read Understand Website Screenshots with a Multimodal Vision Model Learn how to use the Florence-2 multimodal model to generate rich descriptions of website screenshots.
12 Jul 2024 • 4 min read How to Caption Images with a Multimodal Vision Model Learn how to caption images using a multimodal vision model.
10 Jul 2024 • 5 min read How to Use Florence-2 for Optical Character Recognition Learn how to use the Florence-2 model for Optical Character Recognition tasks.
10 Jul 2024 • 4 min read What is Dense Image Captioning? Learn what dense image captioning is and how to use the MIT-licensed Florence-2 model to generate dense image captions.