CogVLM is an open source Large Multimodal Model (LMM). You can use CogVLM to ask questions about text and images. For example, you can ask CogVLM to count the number of objects in an image, to describe a scene, or to read characters in an image.

In qualitative testing, CogVLM achieved stronger performance than LLaVA and BakLLaVA, and achieved similar performance to Qwen-VL and GPT-4 with Vision.

You can deploy CogVLM on your own hardware with Roboflow Inference. Inference is an open source computer vision inference server that enables you to run both foundation models like CogVLM and models that you have trained (i.e. YOLOv8 models).

In this guide, we are going to walk through how to deploy and use CogVLM on your own hardware. We will install Roboflow Inference, then create a Python script that makes requests to the local CogVLM model deployed with Inference.

You can use this guide to deploy CogVLM on any cloud platform, such as AWS, GCP, and Azure. For this guide, we deployed CogVLM on a GCP Compute Engine instance with an NVIDIA T4 GPU.

Without further ado, let’s get started!

CogVLM Capabilities and Use Cases

CogVLM is a multimodal model that works with text and images. You can ask questions in text and optionally provide images as context. CogVLM has a wide range of capabilities that span the following tasks:

  • Visual Question Answering (VQA): Answer questions about an image.
  • Document VQA: Answer questions about a document.
  • Zero-shot object detection: Identify the coordinates of an object in an image.
  • Document OCR: Retrieve the text in a document.
  • OCR: Read text from a real-world image that isn’t a document.

With this range of capabilities, there are many potential use cases for CogVLM across industries. For example, you could use CogVLM in manufacturing to check if there is a forklift positioned near a conveyor belt. You could use CogVLM to read serial numbers on shipping containers.

We recommend testing CogVLM to evaluate the extent to which the model is able to help with your production use case. Performance will vary depending on your use case, the quality of your images, and the model size you use. In the next section, we will talk about the model sizes available.

CogVLM Model Sizes

CogVLM can be run with different degrees of quantization. Quantization is a method used to make large machine learning models smaller so they can be run with lower RAM requirements. The more quantized the model, the faster, but less accurate, the model will be.

You can run CogVLM through Roboflow Inference with three degrees of quantization:

  • No quantization: Run the full model. For this, you will need 80 GB of GPU RAM. You could run the model on an 80 GB NVIDIA A100.
  • 8-bit quantization: Run the model with less accuracy than no quantization. You will need 32 GB of GPU RAM.You could run this model on an A100 with sufficient virtual RAM.
  • 4-bit quantization: Run the model with less accuracy than 8-bit quantization. You will need 16 GB of GPU RAM. You could run this model on an NVIDIA T4.

The model size you should use will depend on the hardware available to you and the level of accuracy required for your application. For the most accurate results, use CogVLM without quantization. In this guide, we will use 4-bit quantization so that we can run the model on a T4.

Step #1: Install Robofow Inference

To deploy CogVLM, we will use Roboflow Inference. Inference uses Docker to create isolated environments in which you can run your vision models. Models deployed using Inference have a HTTP interface through which you can make requests.

First, install Docker on your machine. If you do not already have Docker installed, follow the official Docker installation instructions for your operating system to set up Docker.

Next, we first need to install the Inference Python package and CLI. We can use these packages to set up an Inference Docker container. Run the following command to install the requisite packages:

pip install inference inference-cli

To start an Inference server, run:

inference server start
Note: This command detects the presence or absence of a GPU on your machine and pulls the correct docker image accordingly. If you do not have a GPU, this command will pull a CPU specific docker image which will not have CogVLM built in.

The first time you run this command, a Docker container will be downloaded from Docker Hub. Once you have the container on your machine, the container will start running.

An Inference server will be available at http://localhost:9001.

Step #2: Run CogVLM

All models you deploy with Inference have dedicated HTTP routes. For this guide, we will use the CogVLM route to make a request to a CogVLM model. You can run CogVLM offline once you have downloaded the model weights.

Create a new Python file and add the following code:

import base64
import os
from PIL import Image
import requests

PORT = 9001
API_KEY = ""
IMAGE_PATH = "forklift.png"

def encode_base64(image_path):
    with open(image_path, "rb") as image:
        x =
        image_string = base64.b64encode(x)

    return image_string.decode("ascii")

prompt = "Read the text in this image."

infer_payload = {
    "image": {
        "type": "base64",
        "value": encode_base64(IMAGE_PATH),
    "api_key": API_KEY,
    "prompt": prompt,

results =


This code will make a HTTP request to the /llm/cogvlm route. This route accepts text and images which will be sent to CogVLM for processing. This route returns a JSON object with the text response from the model.

Above, replace:

  1. ROBOFLOW_API_KEY with your Roboflow API key. Learn how to retrieve your Roboflow API key.
  2. image.png with the image that you want to use to make a request.
  3. prompt with the question you want to ask.

Let’s run the code on the following image of a forklift and ask the question “Is there a forklift close to a conveyor belt?”:

Our code returns:

{'response': 'yes, there is a forklift close to a conveyor belt, and it appears to be transporting a stack of items onto it.', 'time': 12.89864671198302

The model returned the correct answer. On the NVIDIA T4 GPU we are using, inference took ~12.9 seconds. Let’s ask the question “Is the worker wearing gloves?”. The model returns:

{'response': 'No, the forklift worker is not wearing gloves.', 'time': 10.490668170008576}

Our model returned the correct response. The model took ~10.5 seconds to calculate a response.

When we asked if the worker in the picture above was wearing a hard hat, the model said "Yes, the worker is wearing a safety hard hat." This was not correct.

As with all multimodal language models, model performance will vary depending on the image you provide, your prompt, and the degree of quantization you apply to your model. We quantized this model so the model would run on a T4, but the model will be more performant without this quantization.


CogVLM is a Large Multimodal Model (LMM). You can ask CogVLM questions about images and retrieve responses. CogVLM is able to perform tasks across different computer vision tasks, from identifying the presence of an object in an image to reading characters in an image to zero-shot object detection.

You can use CogVLM with different levels of quantization to run the model with less RAM. In this guide, we used CogVLM with 4-bit quantization so we could run the model on an NVIDIA T4. We asked two questions about an image of a forklift in a warehouse, retrieving accurate responses.