What is a Foundation Model? An Introduction.

A foundation model is an artificial intelligence (AI) model that is trained on a broad and diverse range of data to perform a wide variety of tasks.

Foundation models can be used without training and can be fine-tuned for a variety of downstream tasks including auto-labeling images for training computer vision models. These models learn patterns, relationships, and representations from vast amounts of data, often through self-supervised or unsupervised learning techniques.

The term "foundation" refers to the idea that these models provide a foundation or a starting point for a wide range of AI applications.

Foundation models are large, pre-trained models that have been trained on vast amounts of data, such as text, images, audio and videos etc. These models are designed to learn general features and patterns in the data, which can be applied to a variety of tasks.

What is Foundation Model (Source)

The key characteristics of foundation models are:

Large-scale pre-training: Foundation models are trained on massive datasets, often using distributed computing and large-scale optimization techniques.
Generalizability: Foundation models can be fine-tuned for a wide range of tasks, often with minimal additional training data.
Transferability: Foundation models can be applied to new, unseen tasks, and can adapt to new domains and datasets.
Flexibility: Foundation models can be used as a starting point for a variety of applications, from language translation to image generation.

Types of Foundation Models

Foundation models can be categorized into the following main three types:

Large language models (LLMs)
Vision language models (VLMs)
Multimodal foundation models

Mind Map of different Foundation Models (Source)

Large Language Models (LLMs)

These models are trained on massive text datasets and specialize in understanding, generating, and processing human language. They can perform tasks like translation, text summarization, sentiment analysis, and question-answering. An example of the language foundation model is GPT-3 (Generative Pre-trained Transformer 3). It is developed by OpenAI, GPT-3 is a powerful language model trained on diverse internet text. It can generate human-like text, answer questions, and assist in writing code. The applications of GPT-3 are chatbots, content creation, automated customer support, and code generation etc.

Vision Language Models (VLMs)

These models are designed to understand and process images and videos. They can recognize objects, classify images, detect anomalies, and even generate visual content. Example of vision foundation model is ViT (Vision Transformer) which is developed by Google. ViT is a transformer-based model for image recognition. Unlike traditional CNNs, ViT treats an image as a sequence of patches and processes it using self-attention mechanisms. The applications of ViT are object detection, medical image analysis, facial recognition, and autonomous driving etc.

ViT Architecture (Source)

Multimodal Foundation Models

These models can process and generate information across multiple data modalities, such as text, images, audio, and video.

An example of multimodal foundation model is CLIP (Contrastive Language-Image Pretraining) is a multimodal foundation model developed by OpenAI that learns to associate text and images through contrastive learning. It is capable of understanding and generating insights from both text and visual data, making it a powerful example of a multimodal AI system.

CLIP Architecture (Source)

Example of Using Foundation Models

Now we will see some examples of using Foundation Models with vision capabilities to explore some of the tasks it can perform. We will use OpenAI GPT-4o, Google Gemini and Llama 3.2 Vision models for our examples. These models are foundation models because they:

Are pretrained on massive multimodal datasets including images
Can process and understand visual information
Can perform various vision-related downstream tasks
Uses self-supervised learning techniques during pretraining
Can be fine-tuned to perform various domain specific downstream tasks.

💡

To use the code provided in this blog, you need to install the following libraries.
pip install roboflow inference inference-sdk gradio

Google Gemini

Google Gemini is a family of advanced, multimodal large language models (LLMs) developed by Google. It processes and generates information across various data types such as text, images, audio, video, and code and can be used to develop versatile applications.

Gemini Model Overview (Source)

Key Features of Gemini

Multimodal Integration

Gemini can process and combines text, images, audio, video, and code in inputs and outputs. It excels at cross-modal tasks such as generating image captions, answering questions about videos.

Scalability

Gemini is available in multiple sizes (e.g., Gemini Nano for on-device use, Gemini Pro for general purposes, Gemini Ultra for advanced tasks). Gemini is optimized for efficiency across platforms, from mobile devices to cloud infrastructure.

Advanced Reasoning

Gemini can perform complex problem-solving in mathematics, coding, and scientific domains. It supports logical inference and contextual understanding.

Multilingual Support

Trained on diverse datasets, enabling fluency in over 100 languages for translation, content generation, and analysis.

Real-Time Capabilities

Gemini can process streaming data (e.g., live video or audio) for applications like real-time translation or interactive assistants.

Integration with Google Ecosystem

Gemini is also built into Google various products such as Workspace, Search, Android etc. and accessible via APIs (Vertex AI, Google AI Studio).

Using Gemini for Object Detection

In this example, we will detect the presence of dogs in an image, identify their breed, and retrieve the bounding box coordinates. Using this information, we will draw a bounding box around the detected dog and display its breed as a label.

Create the Roboflow Workflow with following configuration to Gemini.

Gemini (gemini-1.5-flash) Workflow

The following input image is used for this example.

Input image 'dogs.jpg'

Now, create a new Python file and add the following code:

import cv2
import numpy as np
import matplotlib.pyplot as plt
import re
import json
from PIL import Image, ImageDraw, ImageFont
from inference_sdk import InferenceHTTPClient

# Specify the image filename
image_path = "dogs.jpg" 

client = InferenceHTTPClient(
    api_url="https://detect.roboflow.com",
    api_key="ROBOFLOW_API_KEY"
)

result = client.run_workflow(
    workspace_name="tim-4ijf0",
    workflow_id="custom-workflow-6",
    images={
        "image": image_path
    },
    parameters={
        "prompt": "find all dogs in image and return JSON {'breed':'', 'confidence':'', 'ymin':'', 'xmin':'', 'ymax':'', 'xmax':''} "
    },
    use_cache=True # cache workflow definition for 15 minutes
)

# Function to extract bounding boxes from result
def extract_bboxes(result):
    bbox_list = []
    for item in result:
        output_text = item.get('google_gemini', {}).get('output', '')
        match = re.search(r'```json\n([\s\S]+?)\n```', output_text)  # Extract JSON part
        if match:
            json_text = match.group(1).strip()
            try:
                bbox_list = json.loads(json_text)  # Convert JSON string to Python list
            except json.JSONDecodeError:
                print("Error decoding JSON from result variable")
    return bbox_list

# Extract bounding boxes
bounding_boxes = extract_bboxes(result)

# Load the image using PIL
image = Image.open(image_path)
width, height = image.size
draw = ImageDraw.Draw(image)

# Define a list of colors for different breeds
colors = ["red", "blue", "green", "yellow", "cyan", "magenta"]
breed_color_map = {}  # Dictionary to store colors for each breed

# Font for labels
try:
    font = ImageFont.truetype("NotoSansCJK-Regular.ttc", size=18)  
except IOError:
    font = ImageFont.load_default()  # Use default font

# Draw bounding boxes
for i, obj in enumerate(bounding_boxes):
    breed = obj['breed']
    confidence = float(obj['confidence']) * 100  # Convert confidence to percentage
    ymin, xmin, ymax, xmax = obj['ymin'], obj['xmin'], obj['ymax'], obj['xmax']

    # Assign a unique color per breed
    if breed not in breed_color_map:
        breed_color_map[breed] = colors[len(breed_color_map) % len(colors)]

    color = breed_color_map[breed]

    # Convert normalized coordinates to absolute coordinates
    abs_x1 = int(xmin / 1000 * width)
    abs_y1 = int(ymin / 1000 * height)
    abs_x2 = int(xmax / 1000 * width)
    abs_y2 = int(ymax / 1000 * height)

    # Draw the bounding box
    draw.rectangle(((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=4)

    # Draw the label (breed and confidence)
    label = f"{breed} ({confidence:.1f}%)"
    text_size = draw.textbbox((0, 0), label, font=font)
    text_width, text_height = text_size[2] - text_size[0], text_size[3] - text_size[1]

    # Draw label background
    draw.rectangle(
        ((abs_x1, abs_y1 - text_height), (abs_x1 + text_width, abs_y1)),
        fill=color
    )

    # Draw label text
    draw.text((abs_x1, abs_y1 - text_height), label, fill="white", font=font)

# Display image with bounding box
plt.figure(figsize=(10, 6))
plt.imshow(image)
plt.axis("off")
plt.show()

The code leverages a Roboflow workflow to send an input image and a custom prompt to a Gemini model. The model returns a JSON response that details each detected dog's breed, confidence score, and bounding box coordinates. This data is then processed to draw bounding boxes, each in a unique color corresponding to the detected breed, and labels onto the original image. The normalization of the bounding box coordinates ensures that the boxes are accurately drawn on image. Overall, the Roboflow workflow streamlines the process of using foundation models. You will see the following output when you run the code:

Output

The Gemini detects where the dogs are in image and their breeds.

Gemini Example: Video Search

In this example, we will use Gemini’s video understanding capabilities to detect objects in a video. The model will search for an object, identify it, and return its name along with its bounding box coordinates. This information will then be used to generate an output video, displaying the detected object with a label and bounding box.

💡

The Gemini Free Tier API allows processing only short videos due to limited RPM. For longer video processing, consider using Tier 1 or Tier 2.

Use the same Workflow as above. We will provide following prompt to Gemini:

"prompt": "Is there blue object? Return JSON for it {'ymin':'', 'xmin':'', 'ymax':'', 'xmax':'', 'objname': ''}"

I have used following video for this example by trimming it to shorter length.

Input Video 'robot.mp4' (Source)

Create a new Python file and add this code:

import cv2
import json
import re
import numpy as np
from inference import InferencePipeline
from inference.core.interfaces.camera.entities import VideoFrame

# Initialize video writer
input_video = "robot.mp4"
output_video = "output_with_bboxes.mp4"
cap = cv2.VideoCapture(input_video)
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))
fps = int(cap.get(cv2.CAP_PROP_FPS))
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_video, fourcc, fps, (frame_width, frame_height))

def extract_bboxes(result):
    """Extract bounding boxes from the inference result."""
    bbox_list = []
    output_text = result.get('google_gemini', {}).get('output', '')
    match = re.search(r'```json\n([\s\S]+?)\n```', output_text)
    if match:
        json_text = match.group(1).strip()
        try:
            bbox_list = [json.loads(json_text)] if isinstance(json.loads(json_text), dict) else json.loads(json_text)
        except json.JSONDecodeError:
            print("Error decoding JSON from result variable")
    return bbox_list

def draw_bboxes(frame, bounding_boxes):
    """Draw bounding boxes on the frame."""
    for obj in bounding_boxes:
        objname = obj['objname']
        ymin, xmin, ymax, xmax = obj['ymin'], obj['xmin'], obj['ymax'], obj['xmax']
        abs_x1 = int((xmin / 1000) * frame.shape[1])
        abs_y1 = int((ymin / 1000) * frame.shape[0])
        abs_x2 = int((xmax / 1000) * frame.shape[1])
        abs_y2 = int((ymax / 1000) * frame.shape[0])
        label = f"{objname}"
        
        cv2.rectangle(frame, (abs_x1, abs_y1), (abs_x2, abs_y2), (0, 255, 0), 2)
        print(f"Bounding Box - xmin: {abs_x1}, ymin: {abs_y1}, xmax: {abs_x2}, ymax: {abs_y2}")
        cv2.putText(frame, label, (abs_x1, abs_y1 - 10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    return frame

def process_frame(result, video_frame):
    """Process each frame and draw bounding boxes."""
    frame = video_frame.image
    bounding_boxes = extract_bboxes(result)
    frame_with_boxes = draw_bboxes(frame, bounding_boxes)
    
    # Write frame to output video
    out.write(frame_with_boxes)
    cv2.imshow("Output", frame_with_boxes)
    cv2.waitKey(1)
    
    # Print progress
    print("Processed frame with bounding boxes.")

# Initialize and start the pipeline
pipeline = InferencePipeline.init_with_workflow(
    api_key="ROBOFLOW_API_KEY",
    workspace_name="tim-4ijf0",
    workflow_id="custom-workflow-6",
    video_reference=input_video,
    max_fps=30,
    on_prediction=process_frame,
    workflows_parameters={"prompt": "Is there blue object? Return JSON for it {'ymin':'', 'xmin':'', 'ymax':'', 'xmax':'', 'objname': ''}"}
)

pipeline.start()
pipeline.join()

cap.release()
out.release()
cv2.destroyAllWindows()

This code processes a video to detect a specified object throughout its frames using a Roboflow workflow integrated with the Gemini model. The input video and a prompt defining the object to search for are passed to the workflow. The Gemini model analyzes each frame and returns a JSON response containing the object's name and its bounding box coordinates whenever it is detected. These bounding boxes are then drawn on the frames, along with the object's name, and the processed frames are saved into an output video. The bounding box coordinates are normalized and converted to absolute values to ensure accurate placement in each frame. The pipeline runs frame-by-frame inference, displaying the processed frames in real time while also writing them to the final output video. This automated process enables efficient object detection across an entire video. When you run the code, will see the following output:

Output

This example is highly useful for developing a video search application that can detect objects within a video.

GPT-4o

GPT‑4o (the "o" stands for "omni") is OpenAI’s next-generation, multimodal large language model released in May 2024. It builds on the capabilities of GPT‑4 by integrating text, image, and audio inputs and outputs into a single unified model. GPT-4o is an autoregressive omni-model capable of processing any combination of text, audio, image, and video inputs and generating any combination of text, audio, and image outputs. It is trained end-to-end across multiple modalities, meaning that a same neural network handles all input and output processing for text, vision, and audio seamlessly.

Key Features of GPT-4o

Multimodal Input/Output

GPT-4o processes text, images, audio, and video in single model architecture. It can generate output in different modalities (e.g., answering a text question with an image or voice response).

Variants

GPT‑4o is the full‑scale multimodal model optimized for high performance across text, image, and audio tasks, featuring a very large context window (up to 128,000 tokens) and extensive capabilities.

GPT‑4o‑mini is a smaller, more cost‑efficient variant designed to run with lower latency and reduced computational resources while still delivering many of the core multimodal features.

Real-Time Interaction

It supports low-latency, real-time conversational applications (e.g., voice assistants with instant responses).

Improved Reasoning

GPT-4o has enhanced problem-solving in math, coding, and logic compared to GPT-4. It is better at contextual understanding and long-form content generation.

Multilingual Proficiency

GPT-4o works across 50+ languages for translation, content creation, and analysis.

Safety and Alignment

GPT-4o has built-in safeguards to reduce harmful outputs and bias. It follows ethical guidelines more robustly than earlier models.

GPT-4o Example: Object Detection and Alert System

In this example, we will detect a specific object, a black cup, in a live camera feed and trigger an alert if the object is not found. This approach can be used to develop surveillance applications or fire alert systems, as GPT-4o excels at understanding scenes in image or video and accurately describing real-time events.

Create the Roboflow Workflow for this example with the configuration to OpenAI block as given in the following image.

OpenAI (GPT-4o) Workflow

We will provide following prompt to GPT-4o model:

"prompt": "Do you see black cup? Yes or No."

Create a new Python file and add the following code:

import cv2
from inference import InferencePipeline
from inference.core.interfaces.camera.entities import VideoFrame

# Define a custom sink function to process and display predictions
def my_sink(predictions: dict, video_frame: VideoFrame):
    # Extract the 'output' text from the 'open_ai' key in predictions
    response_text = predictions.get('open_ai', {}).get('output', '')

    # Determine the alert message based on the presence of "Yes" in the response
    if 'Yes' in response_text:
        alert_message = "Black Cup Detected"
        color = (0, 255, 0)  # Green for detection
    else:
        alert_message = "Black Cup Not Detected"
        color = (0, 0, 255)  # Red for no detection

    # Define text properties
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 1.0
    thickness = 2
    position = (50, 50)  # Position to display the alert message

    # Overlay the alert message on the video frame
    cv2.putText(video_frame.image, alert_message, position, font, font_scale, color, thickness, cv2.LINE_AA)

    # Display the video frame with the overlaid alert message
    cv2.imshow("Live Webcam Feed with Alerts", video_frame.image)

    # Exit the display window when 'q' key is pressed
    if cv2.waitKey(1) & 0xFF == ord('q'):
        pipeline.stop()
        cv2.destroyAllWindows()

    
# initialize a pipeline object
pipeline = InferencePipeline.init_with_workflow(
    api_key="ROBOFLOW_API_KEY",
    workspace_name="tim-4ijf0",
    workflow_id="custom-workflow-7",
    video_reference=0, # Path to video, device id (int, usually 0 for built in webcams), or RTSP stream url
    max_fps=30,
    on_prediction=my_sink,
    workflows_parameters={
        "prompt": "Do you see black cup? Yes or No."
    }
)
pipeline.start() #start the pipeline
pipeline.join() #wait for the pipeline thread to finish

This code uses the GPT-4o model through a Roboflow workflow to interpret live video streaming from a webcam and display real-time results. The workflow is initialized with a prompt asking whether a black cup is present in the video feed. The model continuously analyzes the incoming frames and returns a response indicating the presence or absence of the object. Based on the model's output, an alert message, either "Black Cup Detected" (displayed in green) or "Black Cup Not Detected" (displayed in red), is overlaid on the video feed. The processed frames are displayed in a live window, updating dynamically. This setup enables real-time scene interpretation, understanding and decision-making using GPT-4o, making it useful for applications like automated monitoring, surveillance and visual inspection.

You will see an output similar to the following. In the image, the GPT-4o model detects the cup (on the left) when it is within the video frame but does not detect it when a different object is present.

Output

Llama 3.2 Vision

Llama 3.2 Vision is an open-source large language model (LLM) developed by Meta as part of Meta’s latest update to its Llama series. Llama 3.2-Vision extends the capabilities of the Llama 3.1 text-only model. The foundation, Llama 3.1, is an auto-regressive language model based on an optimized transformer architecture, designed for high-performance text processing. Llama 3.2-Vision builds upon this by integrating vision capabilities, enabling it to process and generate outputs for both text and images seamlessly. It is designed to bring multimodal (text and image) capabilities.

Llama 3.2-Vision architecture (Source)

Key Features of Llama 3.2 Vision

Multimodal Capabilities

Llama 3.2 Vision can take both text and images as input and generate text-based outputs. This allows it to understand and reason about visual content, such as describing images, answering questions about them, or interpreting complex visual data like charts and graphs.

Model Sizes

Available in two sizes:

11B: A smaller, more efficient model suitable for tasks requiring moderate computational resources.
90B: A larger, more powerful model excelling in sophisticated reasoning and detailed image understanding, ideal for enterprise-level applications.

High Performance

Llama 3.2 Vision is highly accurate in image recognition and visual understanding tasks. It performs well on industry benchmarks for visual recognition, captioning, and reasoning.

Multilanguage Capabilities

Llama 3.2 Vision officially supports languages like English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai for text only tasks.

Llama 3.2 Vision Example: Image Language Translator Application

In this example, we will create a GUI application using Gradio to detect text in an image with the Llama 3.2 Vision model. The application will identify the language and content of the extracted text, allowing users to translate it into their desired language. For this example I am using following languages:

["English", "Spanish", "French", "German", "Chinese", "Hindi"]

This application is useful in various scenarios, including:

Helps users extract and translate text from images, making it valuable for travelers.
Useful for scanning printed materials, receipts, or handwritten notes and converting them into editable and translatable text.
Helps in extracting product details from packaging or labels and translating them for international markets.
Assists visually impaired users by detecting and reading out text from images in their preferred language.

To build an application, create a Workflow with following configuration to Llama 3.2 Vision block:

Llama 3.2 Vision Workflow

We will provide following prompt to the Llama 3.2 Vision model.

prompt = f"detect language and translate it to {target_language} in JSON format {{ 'detected_language' : '', 'detected_text':'', 'translated_language' : '', 'translated_text':''}}"

Create a new Python file and add the following code:

import gradio as gr
from inference_sdk import InferenceHTTPClient
import json

# Initialize Roboflow client
client = InferenceHTTPClient(
    api_url="https://detect.roboflow.com",
    api_key="ROBOFLOW-API_KEY" # Add your Roboflow API key here
)

def translate_image(image, target_language):
    prompt = f"detect language and translate it to {target_language} in JSON format {{ 'detected_language' : '', 'detected_text':'', 'translated_language' : '', 'translated_text':''}}"

    result = client.run_workflow(
        workspace_name="tim-4ijf0",
        workflow_id="custom-workflow-8",
        images={"image": image},
        parameters={"prompt": prompt},
        use_cache=True
    )

    # Extract and parse JSON output from the response
    raw_output = result[0]['llama_vision']['output']
    translation_data = json.loads(raw_output.replace("'", '"'))

    detected_language = translation_data.get("detected_language", "")
    detected_text = translation_data.get("detected_text", "")
    translated_text = translation_data.get("translated_text", "")

    return detected_language, detected_text, translated_text

languages = ["English", "Spanish", "French", "German", "Chinese", "Hindi"]

# Gradio Interface
iface = gr.Interface(
    fn=translate_image,
    inputs=[
        gr.Image(sources=["webcam", "upload"], type="filepath"),
        gr.Dropdown(choices=languages, label="Target Translation Language")
    ],
    outputs=[
        gr.Textbox(label="Detected Language"),
        gr.Textbox(label="Detected Text"),
        gr.Textbox(label="Translated Text")
    ],
    title="Image Language Translator",
    description="Capture or upload an image, select the target translation language, and get the detected language, detected text, and its translation."
)

iface.launch()

This code creates an image-based language translation application using Gradio and a Roboflow workflow powered by the Llama 3.2 Vision model. The user can upload or capture an image containing text and select a target language for translation. The image is sent to the Roboflow API with a prompt instructing the model to detect the language, extract the text, and translate it into the selected target language, returning the result in JSON format. The extracted JSON is parsed to retrieve the detected language, original text, and translated text, which are then displayed in the Gradio interface. The application allows real-time language detection and translation from images, making it useful for tasks like document translation and multilingual text recognition. Following will be the output.

Output

Conclusion

Foundation models enable a wide range of tasks across text, image, video, and multimodal data. Their ability to learn from vast datasets and adapt to various domains makes them highly efficient for real-world applications.

In this blog, we explored how foundation models like GPT-4o, Google Gemini, and Llama 3.2 Vision can be used for object detection, video search, real-time surveillance, and language translation tasks.

Curious to learn more about foundation models and try some out? Check out the Roboflow Models directory. In this directory, you'll find 100+ computer vision models, including over a dozen foundation models. You can also play around in Roboflow Workflows with the multimodal model blocks – GPT, Gemini, Florence-2 – to see what you can build!