How to Prompt LLMs for Vision: Tips to Boost Accuracy

Published Sep 8, 2025 • 13 min read

Large Language Models (LLMs) such as GPT-4o, Google Gemini, and Anthropic Claude have evolved beyond simple text processing. These models are now multimodal, capable of understanding and integrating various types of input, including images and video, alongside text. This capability allows them to tackle a wide range of computer vision problems, such as object detection, text recognition, and visual question answering, among others.

This blog will guide you on how to effectively structure and use prompts to solve computer vision tasks with Large Multimodal Models (LLMs).

Capabilities of Large Multimodal Models in Computer Vision

Detecting objects and their positions in images
Extracting text from images using Optical Character Recognition (OCR)
Answering questions about visual content through Visual Question Answering (VQA)
Understanding scenes and analyzing contextual information within images
Generating descriptive captions or summaries for images
Integrating text and visual information for complex multimodal reasoning

Large Multimodal Model: Google Gemini

In this blog, we will use Google Gemini as the LLM of choice to demonstrate prompting for solving computer vision problems.

Google Gemini is a family of Large Multimodal Models (LMM) developed by Google DeepMind, designed specifically for multimodality. It includes variants such as Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite, etc., each optimized for different use cases.

Gemini models are free to use but require an API key, which you can easily obtain at no cost from Google AI Studio.

Accessing and Integrating Google Gemini via API

The below script demonstrates how to access Google Gemini models using the API on an example image:

import requests
import base64
import json


def gemini_inference(image, prompt, model_id, temperature, api_key):
    """
    Perform Gemini inference using Google Gemini API with text + image input.
    """
    API_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{model_id}:generateContent?key={api_key}"
    
    payload = {
        "contents": [
            {
                "role": "user",
                "parts": [
                    {"text": prompt},
                    {"inlineData": {"mimeType": "image/jpeg", "data": image}},
                ],
            }
        ],
        "generationConfig": {"temperature": temperature}
    }

    headers = {"Content-Type": "application/json"}
    
    response = requests.post(API_URL, headers=headers, data=json.dumps(payload))
    response.raise_for_status()
    
    result = response.json()
    
    return result["candidates"][0]["content"]["parts"][0]["text"]

if __name__ == "__main__":
    # Encode image to base64 in one line
    base64_image = base64.b64encode(open("test.jpg", "rb").read()).decode("utf-8")

    # Run inference
    output = gemini_inference(
        base64_image,
        prompt="Count the number of objects in the image.",
        model_id="gemini-2.5-flash", # Gemini variant
        temperature=0.0,
        api_key="YOUR_GEMINI_API_KEY"
    )

    print(output)

The example test image used in the above script is:

The output produced by the above script, using the prompt “Count the number of objects in the image.”, would look like this:

Leveraging Google Gemini within Roboflow Workflows

Google Gemini is available as a built-in block in Roboflow Workflows. Roboflow Workflows is a web-based no-code platform for chaining computer vision tasks such as object detection, dynamic cropping, and bounding box visualization, with each task represented as an individual block.

To get started, create a free Roboflow account and log in. Next, create a workspace, then click on “Workflows” in the left sidebar and click on Create Workflow.

You’ll be taken to a blank workflow editor, ready for you to build your AI-powered workflow. Here, you’ll see two workflow blocks: Inputs and Outputs, as shown below:

To add Google Gemini to your workflow, click ‘Add a Model’, search for ‘Google Gemini’, select it, and click ‘Add Model’ to insert the block into your workflow, as shown below:

Now update the ‘API Key’ with your own in the Configure tab, accessible by clicking the ‘Google Gemini’ block in the workflow.

Alongside the 'API Key', the configuration tab includes parameters such as temperature, model version, and max tokens, all of which can be easily adjusted through the Roboflow Workflow UI.

You can now run the workflow as shown below:

0:00

/0:16

Prompting Tips for Multimodal Models to Perform Computer Vision Tasks

Given below are practical strategies for effective prompting, helping you improve accuracy and reduce hallucinations:

Provide Contextual Specificity

Context helps the LLM capture user intent and generate more accurate outputs. Providing detailed situational or background information allows the model to produce more relevant and precise predictions.

The difference is illustrated below, where two different prompts are applied to the same image:

Non Contextual Prompt

The prompt below provides no details about the context in which the description should be generated.

Give me a description of the image.

Non Contextual Prompt Output

This output, based on the above Non-Contextual Prompt, provides a general description of the image without emphasizing any particular purpose or audience. It lists objects and details literally but does not highlight features that would make the image appealing for a specific context, such as a real estate catalog.

Contextual Prompt

The prompt below specifies the context for the description, guiding the model to focus on key features such as space, lighting, style, and comfort for a real estate catalog.

Describe the image for a real estate catalog that highlights its best features such as space, lighting, style, and comfort etc.

Contextual Prompt Output

This output, based on the above Contextual Prompt, is tailored to the context provided. It emphasizes features such as space, lighting, style, and comfort, using persuasive and engaging language suitable for a real estate catalog, rather than merely listing objects in the image.

System Instructions

Rather than including all contextual information within a single prompt, you can use System Instructions to provide additional context alongside your main prompt.

By separating guidance into its own instruction, system instructions simplify the main prompt while still directing the outputs more effectively. For example, the payload below for Gemini Inference demonstrates this:

payload = {
    "systemInstruction": {
        "role": "system",
        "parts": [
            {"text": "You are a professional real estate copywriter. Always describe spaces in a way that highlights their value, comfort, and uniqueness. Use engaging, persuasive, yet natural language that helps potential buyers envision themselves in the home"}
        ]
    },
    "contents": [
        {
            "role": "user",
            "parts": [
                {"text": "Give me a description of the image."},
                {"inlineData": {"mimeType": "image/jpeg", "data": image}},
            ],
        }
    ],
    "generationConfig": {"temperature": temperature}
}

Leverage Examples

Including examples in your prompt shows the LLM what a correct response should look like. Few-shot prompts provide several examples, whereas zero-shot prompts provide none. Well-chosen examples help guide the model’s formatting, phrasing, and scope, enhancing both the accuracy and quality of its outputs.

The difference is illustrated below, where two different prompts are applied to the same image:

Zero-Shot Prompt

The prompt below provides no examples for the model to follow, relying entirely on its general understanding of image description:

Give me a description of the scene in the image.

Zero-Shot Prompt output

This output, based on the above Zero-Shot Prompt, provides a general description of the scene without structured guidance. It captures the overall setting and mood but may lack specific stylistic cues or formatting.

Few-shot Prompt

The prompt below provides specific examples to guide the model. It demonstrates how to structure descriptions, including setting, lighting, camera perspective, and mood:

Give me a description of the scene in the image. Consider these examples:

Example 1 – Rainy Alley Setting: Narrow city alley at night. Lighting: Flickering neon signs, wet reflections. Camera: Low-angle tracking shot. Mood: Tense, mysterious.

Example 2 – Morning Beach Setting: Empty beach at sunrise. Lighting: Soft, warm sunlight; gentle highlights on waves. Camera: Wide shot, slow pan along shoreline. Mood: Calm, serene.

Few-Shot Prompt output

This output, based on the above Few-Shot Prompt, is more structured and detailed. The few-shot prompt conveys structure and stylistic cues in the output without needing to state them explicitly in the prompt.

Best Practices for Leveraging Prompts with Examples

Use examples that show the model the behavior you want. This is generally clearer and leads to better outputs.
Avoid using examples that show the model what to avoid. Can be confusing and sometimes counterproductive.
Keep examples in few-shot prompts uniform, especially regarding XML tags, whitespace, newlines, and separators.

Utilize Structured Output Generations

Most LLMs support Function Calling, which guarantees consistent JSON responses with all required keys. It enforces a structured format, preserves data types, and allows descriptive prompts for each key to capture detailed reasoning, ensuring reliable extraction of insights from images.

The script below demonstrates Function Calling in Gemini, with most of the prompting occurring within function_declarations.

import requests
import base64
import json


def gemini_inference(image, prompt, model_id, temperature, api_key, function_declarations):
    """
    Perform Gemini inference using Google Gemini API with text + image input.
    Uses function calling with a provided schema.
    """
    API_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{model_id}:generateContent?key={api_key}"

    payload = {
        "contents": [
            {
                "role": "user",
                "parts": [
                    {"text": prompt},
                    {"inlineData": {"mimeType": "image/jpeg", "data": image}},
                ],
            }
        ],
        "generationConfig": {"temperature": temperature},
        "tools": [{"functionDeclarations": function_declarations}],
    }

    headers = {"Content-Type": "application/json"}
    response = requests.post(API_URL, headers=headers, data=json.dumps(payload))
    response.raise_for_status()

    result = response.json()
    fn_call = result["candidates"][0]["content"]["parts"][0]["functionCall"]

    return fn_call["args"]


if __name__ == "__main__":
    # Encode image to base64
    base64_image = base64.b64encode(open("test.jpg", "rb").read()).decode("utf-8")

    # Define function schema with descriptions
    function_declarations = [
        {
            "name": "list_objects",
            "description": "Identify objects present in the given image.",
            "parameters": {
                "type": "object",
                "properties": {
                    "objects": {
                        "type": "array",
                        "description": "A list of objects detected in the image.",
                        "items": {
                            "type": "object",
                            "properties": {
                                "label": {
                                    "type": "string",
                                    "description": "The name of the detected object."
                                }
                            },
                            "required": ["label"],
                            "description": "Details about a single object detected in the image."
                        }
                    }
                },
                "required": ["objects"],
                "description": "The structured output containing all detected objects."
            }
        }
    ]

    # Run inference
    output = gemini_inference(
        base64_image,
        prompt="Identify the objects in the image.",
        model_id="gemini-2.5-flash",
        temperature=0.0,
        api_key="YOUR_GEMINI_API_KEY",
        function_declarations=function_declarations,
    )

    print(json.dumps(output, indent=2))

The script output for the prompt “Identify the objects in the image” in the previously provided image of stationery is:

{
  "objects": [
    {
      "label": "headphones"
    },
    {
      "label": "keyboard"
    },
    {
      "label": "mouse"
    },
    {
      "label": "pen"
    },
    {
      "label": "notebook"
    }
  ]
}

You can use the Function Calling capabilities in Roboflow Workflows with the Google Gemini block without writing any code by setting the Task Type to 'Structured Output Generation' and providing the output structure below, using the same function declarations from the earlier script as a raw string.

{
  "output_schema": "{\"name\":\"list_objects\",\"description\":\"Identify objects present in the given image.\",\"parameters\":{\"type\":\"object\",\"properties\":{\"objects\":{\"type\":\"array\",\"description\":\"A list of objects detected in the image.\",\"items\":{\"type\":\"object\",\"properties\":{\"label\":{\"type\":\"string\",\"description\":\"The name of the detected object.\"}},\"required\":[\"label\"],\"description\":\"Details about a single object detected in the image.\"}}},\"required\":[\"objects\"],\"description\":\"The structured output containing all detected objects.\"}}"
}

The below workflow video demonstrates this:

0:00

/0:11

Proper Object Detection Technique

LMMs can perform zero-shot object detection, identifying one or multiple classes simultaneously without any fine-tuning, using only prompts. They can even reason about whether an object should be detected.

When performing object detection with Gemini, always include the instruction: “Return just box_2d, label and no additional text.” in your prompt. For best results, avoid using Gemini’s Function Calling capabilities for object detection and instead rely on its text generation capabilities.

Additionally, the generated bounding box coordinates need to be normalized before use. The script below demonstrates this functionality using the Gemini model:

# Ensure OpenCV is installed before running this script:
# You can install it via: pip install opencv-python-headless

import requests
import json
import base64
import cv2

def clean(results):
    """
    Clean the raw JSON output from the Gemini API.
    """
    return results.strip().removeprefix("```json").removesuffix("```").strip()

def gemini_inference(image, prompt, model_id, temperature, api_key):
    """
    Perform object detection using Google Gemini API.
    """
    # Fixed, plotting function depends on this.
    bounding_box_prompt = "Return just box_2d, label and no additional text."
    API_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{model_id}:generateContent?key={api_key}"
    
    payload = {
        "contents": [
            {
                "role": "user",
                "parts": [
                    {"text": prompt + bounding_box_prompt},
                    {"inlineData": {"mimeType": "image/jpeg", "data": image}},
                ],
            }
        ],
        "generationConfig": {"temperature": temperature}
    }

    headers = {"Content-Type": "application/json"}
    response = requests.post(API_URL, headers=headers, data=json.dumps(payload))
    response.raise_for_status()
    
    result = response.json()
    return result["candidates"][0]["content"]["parts"][0]["text"]

if __name__ == "__main__":
    # Encode image to base64
    image = "test.jpg"
    with open(image, "rb") as f:
        base64_image = base64.b64encode(f.read()).decode("utf-8")

    # Run inference
    output = gemini_inference(
        base64_image,
        prompt="Detect only the analog objects in the image with correct labels.",
        model_id="gemini-2.5-flash",
        temperature=0.0,
        api_key="YOUR_GEMINI_API_KEY"
    )

    boxes = json.loads(clean(output))
    print(boxes)

    # Load image with OpenCV
    image_cv2 = cv2.imread(image)
    h, w, _ = image_cv2.shape

    for item in boxes:
        # Gemini model returns [y1, x1, y2, x2] normalized to 0-1000
        y1, x1, y2, x2 = item["box_2d"]
        y1 = int(y1 / 1000 * h)
        x1 = int(x1 / 1000 * w)
        y2 = int(y2 / 1000 * h)
        x2 = int(x2 / 1000 * w)

        # Draw green rectangle
        rectangle_thickness = 16
        cv2.rectangle(image_cv2, (x1, y1), (x2, y2), (0, 255, 0), thickness=rectangle_thickness)

        # Draw text directly
        text = item["label"]
        font_scale = 3
        font_thickness = 6
        cv2.putText(image_cv2, text, (x1, max(0, y1 - 4)), cv2.FONT_HERSHEY_SIMPLEX, font_scale, (255, 0, 0), font_thickness)

    # Save and display annotated image_cv2
    cv2.imwrite("annotated_image.jpg", image_cv2)

You can also integrate the above script into Roboflow workflows using Custom Python code blocks. This example demonstrates how to use the script within a Roboflow workflow for object detection.

The script output for the prompt “Detect only the analog objects in the image with correct labels” is:

Break Complex Prompts into Smaller Tasks

In certain cases when inference on image tasks are complex then you can break it down into more steps to make it manageable.

For example, to perform an OCR task on the image below:

Given the prompt, “List all the contents of the shelf labels in the image,” the output below was produced:

In the above output, various errors and hallucinations can be observed in the product names. The OCR produced incorrect results, such as inaccurate milliliter values, and some words in the full product names have been omitted.

To address this, rather than creating a lengthy prompt detailing every instruction and exception, which can lead to hallucinations and inconsistent results across images, it is more effective to first perform object detection on the shelf labels to identify their bounding boxes, and then use these bounding boxes to crop the shelf label images, as shown below:

The cropped images can then be pre-processed to enhance their quality and individually processed with an LMM for OCR. This approach greatly improves OCR accuracy and overall reliability. An example of this technique is demonstrated in this blog.

Leverage Grounding Search to Reduce Hallucinations

Grounding search in Google Gemini improves computer vision by linking what the model sees in an image to real world knowledge from Google Search, ensuring more accurate, up to date, and reliable recognition while reducing hallucinations.

For example consider the image:

In this image, we can use Gemini inference to detect the phone's brand and apply grounding search to retrieve the latest price for that brand.

The prompt we’ll use is: “Identify the image and retrieve real-time prices from the web with sources.” The code below demonstrates how to use this prompt with grounding search for Gemini inference.

import requests
import base64
import json

def gemini_inference(image, prompt, model_id, temperature, api_key):
    """
    Perform Gemini inference using Google Gemini API with text + image input,
    and apply grounding search (Google Search Retrieval).
    """
    API_URL = f"https://generativelanguage.googleapis.com/v1beta/models/{model_id}:generateContent?key={api_key}"
    
    payload = {
        "contents": [
            {
                "role": "user",
                "parts": [
                    {"text": prompt},
                    {"inlineData": {"mimeType": "image/jpeg", "data": image}},
                ],
            }
        ],
        "generationConfig": {"temperature": temperature},
        "tools": [
            {
                "google_search": {}  # Enable grounding search
            }
        ]
    }

    headers = {"Content-Type": "application/json"}
    
    response = requests.post(API_URL, headers=headers, data=json.dumps(payload))
    response.raise_for_status()
    
    result = response.json()
    return result["candidates"][0]["content"]["parts"][0]["text"]


if __name__ == "__main__":
    # Encode image to base64
    base64_image = base64.b64encode(open("phone.webp", "rb").read()).decode("utf-8")

    # Run inference with grounding search
    output = gemini_inference(
        base64_image,
        prompt="Identify the image and get me real-time prices from the web with sources.",
        model_id="gemini-2.5-flash",  # Gemini variant
        temperature=1,
        api_key="YOUR_GEMINI_API_KEY"
    )

    print(output)

The output of the above script will be similar to the following:

Experiment with Parameters

LMM like Google Gemini provide various parameters that can be used alongside a prompt to fine-tune the outputs. These parameters include:

Thinking Budget: The thinking budget sets how many tokens the model can use for reasoning. Higher values enable more detailed responses, while lower values reduce latency. Setting it to 0 disables thinking, and -1 enables dynamic adjustment based on request complexity.
Temperature: Temperature controls the randomness of token selection during sampling. Lower values make responses more deterministic, while higher values increase diversity and creativity. A temperature of 0 always selects the highest-probability response.
Top P: topP selects tokens whose cumulative probability reaches the topP value. For example, with topP=0.5, only tokens contributing to 50% probability are considered, and temperature determines the final choice.
Top K: topK limits token selection to the K most probable options. For example, topK=1 selects the single most likely token (greedy decoding), while topK=3 samples from the top 3 tokens, with temperature and topP applied for final selection.
Stop Sequences: Defines character sequences that make the model stop generating. Avoid using sequences likely to appear in the output.
Max Output Tokens: Specifies the maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

You can find the default parameter values for the all Gemini models here under ‘Parameter defaults’.

Below is the generationConfig dictionary, which shows how the above parameters can be used with Gemini Inference inside the payload:

"generationConfig": {
    "temperature": 0,
    "topK": 5,
    "topP": 0.9,
    "maxOutputTokens": 8000,
    "stopSequences": ["<THE END>"], # The LLM stops generating when this sequence appears.
    "thinkingConfig": {
        "thinkingBudget": -1  # Dynamic thinking enabled
    }
}

Prompting tips for LLMs with Vision Capabilities Conclusion

Large Multimodal Models like Google Gemini have transformed computer vision by integrating text and image understanding. By carefully structuring prompts, leveraging contextual cues and examples, and using structured outputs or grounding search, you can significantly improve accuracy in tasks such as object detection, OCR, and visual question answering.

Additionally, Roboflow Workflows provides a no-code interface with a variety of multimodal model blocks, including closed-source options like Google Gemini and GPT-4, as well as open-source models such as CogVLM, LLaVA-1.5, and PHI-4, making it easy to quickly develop computer vision applications.

To learn more about building with Roboflow Workflows, check out the Workflows launch guide.

Written by Dikshant Shah

Cite this Post

Use the following entry to cite this post in your research:

Contributing Writer. (Sep 8, 2025). Prompting Tips for Large Language Models with Vision Capabilities. Roboflow Blog: https://blog.roboflow.com/prompting-tips-for-large-language-models-with-vision/

Stay Connected

Get the Latest in Computer Vision First

Written by

Contributing Writer

View more posts

Topics

Computer Vision

Prompting Tips for Large Language Models with Vision Capabilities