How to Use a VLM to Control a PC
Published May 11, 2026 • 4 min read

Using a VLM to control a PC means handing a model the same thing a person works from, a picture of the screen, and letting it decide what to click, type, or open next. No API, no integration, no brittle scripts tied to exact pixel positions. The model looks at the interface, finds the button you asked for, and acts on it. This is often called computer use, and it has moved from a research demo to something you can run on your own machine.

In a recent Roboflow webinar, machine learning engineer Matvei Popov shows exactly this. He runs the Qwen 3.5 vision language model locally, then hands it control of his screen and watches it click through the Roboflow interface to start a model training job, hands off the keyboard.

This post explains what using a VLM to control a PC actually involves, why it is useful, why models are suddenly good at it, and then points you to the webinar where you can see it happen.

What Using a VLM to Control a PC Means

A vision language model, or VLM, is a model that takes in an image and text together and responds in text. When you point one at PC control, the loop is straightforward to describe. The system captures a screenshot, sends it to the model with an instruction like find and click the train button, and the model returns where to act, usually as screen coordinates. Code then executes that click. Repeat for the next step, and the model is driving the software.

The important part is that the model is working from what is on screen, not from a list of functions a developer wired up in advance. That is what separates computer use from traditional automation. A normal script breaks the moment a layout shifts or a button moves. A VLM reads the current screen each time, the same way a new employee would, so it can operate software that was never built to be automated.

Why Is Using a VLM to Control a PC Useful?

Most automation assumes there is an API to call. A great deal of real work does not have one. Legacy desktop tools, internal dashboards, vendor portals, and one-off interfaces all expect a human with a mouse. Using a VLM to control a PC reaches all of that, because anything a person can see and click becomes something the model can see and click.

That opens up a few concrete uses. You can automate repetitive interface work that no one wants to do by hand. You can build agents that complete a task end to end across several applications rather than answering a single question. You can drive testing and QA by having a model exercise a UI like a user. And the same skill that grounds an instruction to a point on a screen extends to grounding actions in the physical world, which is why this matters well beyond the desktop.

Roboflow's guide to building vision language pipelines with VLMs goes deeper on wiring VLMs into real workflows, and computer use with newer OpenAI VLMs covers the broader pattern.

Why this Suddenly Works

Controlling a PC by sight asks a model to do several hard things at once: read an arbitrary interface, understand a plain-language goal, and pin its answer to an exact location. Until recently those capabilities lived in separate models that did not coordinate well.

What changed is that the newest VLMs combine vision, language, and coding in a single model, along with tool calling. Qwen 3.5, the model in the webinar, is one of the first from Alibaba to fold all of that into one release rather than bolting a vision encoder onto a separate language model.

As Matvei puts it, co-developing those abilities together "creates a lot of mutual benefits." A single model that can perceive the screen, reason about the request, and output a precise coordinate is what makes reliable computer use possible.

The model family also spans a wide range of sizes, including a very small 0.8B option, which matters for running it close to where the work happens.

Qwen 3.5 Controls a PC, Hands off the Keyboard

Matvei starts simple. He runs Qwen 3.5 with Roboflow Inference and adds the Qwen 3.5 block to a workflow, then asks it to describe an image. The model reads the scene and the details in it, the expected behavior of a strong VLM.

Then he turns it loose on his computer. A short Python script captures his screen, sends each instruction to Qwen 3.5, and converts the coordinate the model returns into a real click. He chains the steps needed to launch a training run, click train model, click custom training, click the model, click start training, takes his hands off the keyboard, and lets the model work through the interface on its own.

It starts a real RF-DETR training job without a human touching the mouse. The point is not the specific clicks. It is the proof that a general vision language model, given only screenshots and plain instructions, can operate real software end to end. Seeing it happen live, including the small joke that you could then train Qwen inside Roboflow itself, lands the idea better than any description.

Watch a VLM Control a PC Live

The full session covers the local setup, the model parameters, and the live computer-use demo from start to finish. Watch it on YouTube here.

Then try it yourself. Add the Qwen 3.5 block to a workflow and run it on your own images in Roboflow Workflows, and see the walkthrough of using Qwen 3.5 in Roboflow for the written version.

Cite this Post

Use the following entry to cite this post in your research:

Contributing Writer. (May 11, 2026). How to Use a VLM to Control a PC. Roboflow Blog: https://blog.roboflow.com/use-a-vlm-to-control-pc/

Stay Connected
Get the Latest in Computer Vision First
Unsubscribe at any time. Review our Privacy Policy.

Written by

Contributing Writer