How to Deploy CogVLM on AWS

Published Dec 20, 2023 • 3 min read

💡

Due to dependencies conflicts with newer models and security vulnerabilities discovered in transformers library patched in the versions of library incompatible with the model we announced End Of Life for CogVLM support in inference, effective since release 0.38.0.

We are leaving this page only for reference.

We encourage you to try fully-supported Visual Language Models supported by inference, including Qwen2.5-VL.

CogVLM, a powerful open-source Large Multimodal Model (LMM), offers robust capabilities for tasks like Visual Question Answering (VQA), Optical Character Recognition (OCR), and Zero-shot Object Detection.

In this guide, I'll walk you through deploying a CogVLM Inference Server with 4-bit quantization on Amazon Web Services (AWS). Let's get started.

Setup EC2 Instance

This section is crucial even for those experienced with EC2. It will help you understand the hardware and software requirements for a CogVLM Inference Server.

To start the process search for EC2, then under 'Instances' click the 'Launch Instances' button and fill out the form according to the specifications below.

GPU Memory: The 4-bit quantized CogVLM model requires 11 GB of memory. Opt for an NVIDIA T4 GPU, typically available in AWS g4dn instances. You might need to request an increase in your AWS quota to access these instances.
CUDA and Software Requirements: Ensure your machine has at least CUDA 11.7 and Docker supporting NVIDIA. Choosing an OS Image like 'Deep Learning AMI GPU Pytorch' simplifies the process.
Network: For this setup, allow all incoming SSH and HTTP traffic for secure access and web connections.
Keys: Create and securely store an SSH key for accessing your machine.
Storage: Allocate around 50 GB for the Docker image and CogVLM model weights, with a little extra space as a buffer.

Setup Inference Server

Once logged in via SSH using your locally saved key, proceed with the following steps:

Check CUDA Version, GPU Accessibility, and Verify Docker and Python Installations.

# verify GPU accessibility and CUDA version 
nvidia-smi 

# verify Docker installation 
docker --version
nvidia-docker --version

# verify Python installation 
python --version

Install Python packages and start the Inference Server.

# install required python packages
pip install inference==0.9.7rc2 inference-cli==0.9.7rc2 

# start inference server
inference server start

This step involves downloading a large Docker image (11GB) to run CogVLM, which might take a few minutes.

Run docker ps to make sure the server is running. You should see a roboflow/roboflow-inference-server-gpu:latest container running in the background.

Run Inference

To test the CogVLM inference, use a client script available on GitHub:

Clone the repository and set up the environment.

# clone cog-vlm-client repository
git clone https://github.com/roboflow/cog-vlm-client.git
cd cog-vlm-client

# setup python environment and activate it [optional]
python3 -m venv venv
source venv/bin/activate

# install required dependencies
pip install -r requirements.txt

# download example data [optional]
./setup.sh

Acquire your Roboflow API key and export it as an environment variable to authenticate to the Inference Server.

export ROBOFLOW_API_KEY="xSI558nrSshjby8Y4WMb"

Run the Gradio app and query images.

python app.py

The Gradio app will generate for you a unique link that you can use to query your CogVLM model from any computer or phone.

Note: The first request to the server might take several minutes as it loads model weights into the GPU memory. Monitor this process using docker system df and nvidia-smi. Subsequent requests shouldn’t take longer than a dozen seconds.

`docker system df` output after loading the Inference Server image and CogVLM weights

`nvidia-smi` output after loading CogVLM weights into memory

Conclusions

CogVLM is a versatile and powerful LMM, adept at handling a range of computer vision tasks. In many cases, it can successfully replace GPT-4V and give you more control. Visit the Inference documentation to learn how to deploy CogVLM as well as other computer vision models.

Cite this Post

Use the following entry to cite this post in your research:

Piotr Skalski. (Dec 20, 2023). How to Deploy CogVLM on AWS. Roboflow Blog: https://blog.roboflow.com/how-to-deploy-cogvlm-in-aws/

Stay Connected

Get the Latest in Computer Vision First

Model Playground

Compare VLM Models Side-by-Side

Written by

Piotr Skalski

ML Growth Engineer @ Roboflow | Owner @ github.com/SkalskiP/make-sense (2.4k stars) | Blogger @ skalskip.medium.com/ (4.5k followers)

View more posts

How to Deploy CogVLM on AWS

Setup EC2 Instance

Setup Inference Server

Run Inference

Conclusions

Cite this Post

Written by

Topics

More About

How to Fine-Tune a SmolVLM2 Model on a Custom Dataset

OpenAI o3-pro: Multimodal and Vision Analysis

OpenAI o3 and o4-mini: Multimodal and Vision Analysis

OpenAI GPT-4.1: Multimodal and Vision Analysis

Gemma 3: Multimodal and Vision Analysis

Foundational Few-Shot Object Detection Challenge [CVPR 2025]