Update: many of the below findings are possible from users across the web. A special note to Simon Willison, whom published “Multi-modal prompt injection image attacks against GPT-4V” on the topic. If you find additional examples you'd also like included here, please let us know.
Prompt injection is a vulnerability in which attackers can inject malicious data into a text prompt, usually to execute a command or extract data. This compromises the system's security, allowing unauthorized actions to be performed.
Some time ago we showed you how to use prompt injection to jailbreak OpenAI’s Code Interpreter, allowing you to install unauthorized Python packages and run Computer Vision models in a seemingly closed environment. In this blog, we will show you what Vision Prompt Injection is, how it can be used to steal your data, and how to defend against it.
GPT-4 and Vision Prompt Injection
On September 25, 2023, OpenAI announced the launch of a new feature that expands how people interact with its latest and most advanced model, GPT-4V(ision): the ability to ask questions about images.
Among other things, GPT-4 is now able to read the text found in uploaded images. At the same time, this update opened a new vector of attack on Large Language Models (LLMs) . Instead of putting a malicious phrase in a text prompt, it can be injected through an image.
In the uploaded image, there is text with added instructions. Much like in conventional prompt injection scenarios, the model ignores the user's directives and acts on the instructions embedded in the image.
To make matters worse, the text on the image does not have to be visible. One way to hide text is to render it in a color almost identical to the background. This makes the text invisible to the human eye, but possible to extract with the right software. It turns out that GPT-4 is so good at Optical Character Recognition (OCR) that it makes it vulnerable to this form of attack.
But what if someone decides to exploit this vulnerability to launch an attack on an LLM-based system?
ChatGPT doesn't give you too many ways to communicate with the outside world. However, it allows you to generate clickable links, which are usually used to retrieve processing results. Johann Rehberger found a way to take advantage of this functionality and arm his Visual Prompt Injection exploit.
The above instructions cause the chat history to be included in the URL and rendered as an image in Markdown. This way you don't even have to click the link, the HTTP request is sent automatically. The server needs only to parse it back.
How to Defend Yourself
It is only a matter of time before many of us start building applications using these types of models. They could be used, for example, to automatically process resumes submitted by candidates.
Defending against jailbreaks is difficult. This is because it requires teaching the model how to distinguish between good and bad instructions. Unfortunately, almost all methods that increase the security of LLM, at the same time lead to reduced usability of the model.
Vision Prompt Injection is a brand new problem. The situation is made even more difficult by the fact that GPT-4 Vision is not open-source and we don't quite know how text and vision input affect each other. I tried techniques based on adding additional instructions in the text part and ordering the LLM to ignore potential instructions contained in the image. It seems to improve the model's behavior, at least to some extent.
The only thing we can do at the moment is to make sure we are aware of this problem and take it into account every time we design LLM-based products. Both OpenAI and Microsoft are actively researching to protect LLMs from jailbreaks.
Did you find more vision prompt injections? Share them on Twitter and tag @roboflow!