Multimodal

YOLO-World: Real-Time, Zero-Shot Object Detection

YOLO-World is a zero-shot, real-time object detection model.

First Impressions with Gemini Advanced

Read our first impressions using the Gemini Ultra multimodal model across a range of computer vision tasks.

Launch: GPT-4 Checkup

GPT-4 Checkup is a web utility that monitors the performance of GPT-4 with Vision over time. Learn how to use and contribute to GPT-4 Checkup

NeurIPS 2023 Papers Highlights

Introduction NeurIPS 2023, the conference and workshop on Neural Information Processing Systems, took place December 10th through 16th. The conference showcased the latest in machine learning and artificial intelligence. This

How to Deploy CogVLM on AWS

Guide on deploying a CogVLM Inference Server with 4-bit quantization on Amazon Web Services, covering setup of EC2 instances, configuring hardware and software requirements, and starting the inference server with Docker.

CogVLM Use Cases in Industry

Learn how you can use CogVLM, a multimodal language model with vision capabilities, for industrial use cases.

How to Deploy CogVLM

In this guide, learn how to deploy the CogVLM multimodal model on your own infrastructure with Roboflow Inference.

First Impressions with Google’s Gemini

In this guide, we evaluate Google's Gemini LMM against several computer vision tasks, from OCR to VQA to zero-shot object detection.

Google's Gemini Multimodal Model: What We Know

In this guide, we are going to discuss what Gemini is, for whom it is available, and what Gemini can do (according to the information available from Google). We will also look ahead to potential applications for Gemini in computer vision tasks.

Multimodal Maestro: Advanced LMM Prompting

Learn how to expand the range of LMMs' capabilities using Multimodal Maestro

GPT-4 Vision Alternatives

Explore alternatives to GPT-4 Vision with Large Multimodal Models such as Qwen-VL and CogVLM, and fine-tuned detection models.

Distilling GPT-4 for Classification with an API

In this guide, learn how to distill GPT-4V to train an image classification model.

DINO-GPT4-V: Use GPT-4V in a Two-Stage Detection Model

In this guide, we introduce DINO-GPT4V, a model that uses Grounding DINO to detect general objects and GPT-4V to refine labels.

How CLIP and GPT-4V Compare for Classification

In this post, we analyze how CLIP and GPT-4V compare for classification.

Experiments with GPT-4V for Object Detection

See our experiments that explore GPT-4V's object detection capabilities.

First Impressions with LLaVA-1.5

In this guide, we share our first impressions testing LLaVA-1.5.

GPT-4 with Vision: Complete Guide and Evaluation

In this guide, we share findings experimenting with GPT-4 with Vision, released by OpenAI in September 2023.

How Good Is Bing (GPT-4) Multimodality?

In this blog post, we qualitatively analyze how well Bing’s combination of text and image input ability performs at object detection tasks.

Multimodal Models and Computer Vision: A Deep Dive

In this post, we discuss what multimodals are, how they work, and their impact on solving computer vision problems.