Multimodal

Distilling GPT-4 for Classification with an API

In this guide, learn how to distill GPT-4V to train an image classification model.

DINO-GPT4-V: Use GPT-4V in a Two-Stage Detection Model

In this guide, we introduce DINO-GPT4V, a model that uses Grounding DINO to detect general objects and GPT-4V to refine labels.

How CLIP and GPT-4V Compare for Classification

In this post, we analyze how CLIP and GPT-4V compare for classification.

Experiments with GPT-4V for Object Detection

See our experiments that explore GPT-4V's object detection capabilities.

GPT-4 Vision Prompt Injection

In this article, we explore what prompt injection is and the techniques people have been using to perform prompt injection attacks on GPT-4.

First Impressions with LLaVA-1.5

In this guide, we share our first impressions testing LLaVA-1.5.

GPT-4 with Vision: Complete Guide and Evaluation

In this guide, we share findings experimenting with GPT-4 with Vision, released by OpenAI in September 2023.

Using Stable Diffusion and SAM to Modify Image Contents Zero Shot

Introduction Recent breakthroughs in large language models (LLMs) and foundation computer vision models have unlocked new interfaces and methods for editing images or videos. You may have heard of inpainting,

How to Build a Semantic Image Search Engine with Supabase and OpenAI CLIP

Historically, building a robust search engine for images was difficult. One could search by features such as file name and image metadata, and use any context around an image (i.

ChatGPT Code Interpreter for Computer Vision

In this article, we share the results of our experimentation with ChatGPT's code interpreter feature on various computer vision tasks.

How Good Is Bing (GPT-4) Multimodality?

In this blog post, we qualitatively analyze how well Bing’s combination of text and image input ability performs at object detection tasks.

Multimodal Models and Computer Vision: A Deep Dive

In this post, we discuss what multimodals are, how they work, and their impact on solving computer vision problems.

Zero-Shot Image Annotation with Grounding DINO and SAM - A Notebook Tutorial

In this comprehensive tutorial, discover how to speed up your image annotation process using Grounding DINO and Segment Anything Model. Learn how to convert object detection datasets into instance segmentation datasets, and use these models to automatically annotate your images.

Speculating on How GPT-4 Changes Computer Vision

OpenAI released GPT-4 showcasing strong multi-modal general AI capabilities in addition to impressive logical reasoning capability. Are general models going to obviate the need to label images and train models?

OpenAI's CLIP is the most important advancement in computer vision this year

CLIP is a gigantic leap forward, bringing many of the recent developments from the realm of natural language processing into the mainstream of computer vision: unsupervised learning, transformers, and multimodality

Experimenting with CLIP and VQGAN to Create AI Generated Art

Earlier this year, OpenAI announced a powerful art-creation model called DALL-E. Their model hasn't yet been released but it has captured the imagination of a generation of hackers,

How to Try CLIP: OpenAI's Zero-Shot Image Classifier

Earlier this week, OpenAI dropped a bomb on the computer vision world.