Blog

Multimodal

Latest Posts Case Studies Product Updates Logistics Manufacturing

How to Fine-tune PaliGemma for Object Detection Tasks

17 May 2024 • 7 min read

How to Fine-tune PaliGemma for Object Detection Tasks

Learn how to fine-tune the PaliGemma multimodal model to detect custom objects.

Finetuning Moondream2 for Computer Vision Tasks

17 May 2024 • 8 min read

Finetuning Moondream2 for Computer Vision Tasks

In this guide, we finetune and improve Moondream2, a small, local, fast multimodal Vision Language Model, for a computer vision task.

PaliGemma: An Open Multimodal Model by Google

15 May 2024 • 10 min read

PaliGemma: An Open Multimodal Model by Google

PaliGemma is a vision language model (VLM) developed and released by Google that has multimodal capabilities. Learn how to use it.

GPT-4o: The Comprehensive Guide and Explanation

14 May 2024 • 10 min read

GPT-4o: The Comprehensive Guide and Explanation

Learn what GPT-4o is, how it differs from previous models, evaluate its performance, and use cases for GPT-4o.

clip

26 Mar 2024 • 9 min read

Ultimate Guide to Using CLIP with Intel Gaudi2

Learn how to use CLIP on the Intel Gaudi2 chip. This guide discusses training and deploying a custom CLIP model on Gaudi2.

Launch: YOLO-World Support in Roboflow

21 Mar 2024 • 4 min read

Launch: YOLO-World Support in Roboflow

Learn how you can use YOLO-World with Roboflow.

Best OCR Models for Text Recognition in Images

16 Mar 2024 • 7 min read

Best OCR Models for Text Recognition in Images

See how nine different OCR models compare for scene text recognition across industrial domains.

What is Visual Question Answering (VQA)?

13 Mar 2024 • 10 min read

What is Visual Question Answering (VQA)?

Learn what Visual Question Answering (VQA) is, how it works, and explore models commonly used for VQA.

First Impressions with the Claude 3 Opus Vision API

5 Mar 2024 • 6 min read

First Impressions with the Claude 3 Opus Vision API

The Roboflow team ran several computer vision tests using the Claude 3 Opus Vision API. Read our results.

Multimodal Video Analysis with CLIP using Intel Gaudi2 HPUs

3 Mar 2024 • 6 min read

Multimodal Video Analysis with CLIP using Intel Gaudi2 HPUs

Learn how to use CLIP and the Intel Gaudi2 chip to run multimodal analyses and classification on videos.

Build an Image Search Engine with CLIP using Intel Gaudi2 HPUs

28 Feb 2024 • 9 min read

Build an Image Search Engine with CLIP using Intel Gaudi2 HPUs

Learn how to use the Intel Gaudi2 chip to build an image search engine with CLIP embeddings.

Tips and Tricks for Prompting YOLO World

23 Feb 2024 • 6 min read

Tips and Tricks for Prompting YOLO World

Explore six tips on how to effectively use YOLO-World to identify objects in images.

gaudi

20 Feb 2024 • 8 min read

Build Enterprise Datasets with CLIP for Multimodal Model Training Using Intel Gaudi2 HPUs

In this guide, learn how to use CLIP on Intel Gaudi2 HPUs to deduplicate datasets before training large multimodal vision models.

YOLO-World: Real-Time, Zero-Shot Object Detection

13 Feb 2024 • 6 min read

YOLO-World: Real-Time, Zero-Shot Object Detection

YOLO-World is a zero-shot, real-time object detection model.

First Impressions with Gemini Advanced

8 Feb 2024 • 7 min read

First Impressions with Gemini Advanced

Read our first impressions using the Gemini Ultra multimodal model across a range of computer vision tasks.

checkup

5 Jan 2024 • 4 min read

Launch: GPT-4 Checkup

GPT-4 Checkup is a web utility that monitors the performance of GPT-4 with Vision over time. Learn how to use and contribute to GPT-4 Checkup

NeurIPS 2023 Papers Highlights

21 Dec 2023 • 5 min read

NeurIPS 2023 Papers Highlights

NeurIPS 2023, the conference and workshop on Neural Information Processing Systems, took place December 10th through 16th. The conference showcased the latest in machine learning and artificial intelligence. This year’s conference featured 3,584 papers that advance machine learning across many domains. NeurIPS announced the NeurIPS 2023 award-winning papers

cogvlm on aws

20 Dec 2023 • 3 min read

How to Deploy CogVLM on AWS

Guide on deploying a CogVLM Inference Server with 4-bit quantization on Amazon Web Services, covering setup of EC2 instances, configuring hardware and software requirements, and starting the inference server with Docker.

CogVLM Use Cases in Industry

20 Dec 2023 • 5 min read

CogVLM Use Cases in Industry

Learn how you can use CogVLM, a multimodal language model with vision capabilities, for industrial use cases.

How to Deploy CogVLM

14 Dec 2023 • 5 min read

How to Deploy CogVLM

In this guide, learn how to deploy the CogVLM multimodal model on your own infrastructure with Roboflow Inference.

First Impressions with Google’s Gemini

13 Dec 2023 • 6 min read

First Impressions with Google’s Gemini

In this guide, we evaluate Google's Gemini LMM against several computer vision tasks, from OCR to VQA to zero-shot object detection.

Google's Gemini Multimodal Model: What We Know

7 Dec 2023 • 11 min read

Google's Gemini Multimodal Model: What We Know

In this guide, we are going to discuss what Gemini is, for whom it is available, and what Gemini can do (according to the information available from Google). We will also look ahead to potential applications for Gemini in computer vision tasks.

Multimodal Maestro: Advanced LMM Prompting

29 Nov 2023 • 3 min read

Multimodal Maestro: Advanced LMM Prompting

Learn how to expand the range of LMMs' capabilities using Multimodal Maestro

Launch: Synthetic Image Generation with DALL-E and GPT-4 Vision

28 Nov 2023 • 5 min read

Launch: Synthetic Image Generation with DALL-E and GPT-4 Vision

In this guide, learn how to use Roboflow to generate synthetic data with DALL-E and GPT-4 Vision for use in training vision models.

Stay Connected

Get the Latest in Computer Vision First