is an online game that uses AI to score user-submitted digital drawings to zany prompts like, "Draw a giraffe in the arctic" or "Draw a bumblebee loves capitalism." It's Cards Against Humanity meets Microsoft Paint.

A submission on of a raccoon using a computer.
The quality and quantity of submissions on really impressed us. became an internet sensation. In the week of its launch, players submitted over 100,000 drawings – peaking at nearly 3 per submissions per second. It made it to the front page of Hackers News and r/InternetIsBeautiful.

We (Roboflow) built Paint with our friends at Booste to play with using OpenAI's CLIP model. We challenged ourselves to build something in a weekend shortly after OpenAI open sourced CLIP.

This post walks through how we used AI to build a popular internet game.

If you prefer video content over text, subscribe to our YouTube. at a Glance: An AI Sandwich with Humans in the Middle

At a high level, Paint flows the following way:

  1. A user sees a silly prompt to draw
  2. The user draws that prompt
  3. Paint scores that user's drawing and assigns a ranking

The prompts are deliberately silly. Users are prompted to draw "a shark in a barrel," "the world's fastest frog," "a superhero who rides a bike," and so much more.

How'd we come up with so many creative prompts? We didn't. We came up with an initial set and used AI to generate the rest. We wrote the first ten prompts and passed them to GPT-2, a generative text model, to receive thousands more. GPT-2 came up with prompts we certainly wouldn't have considered. The GPT generated prompts did require a human once-over to ensure the game encouraged fun.

The user is then prompted to draw the prompt. To handle drawing in the browser, we made use of Literally Canvas, which gives Microsoft Paint-like functionality.

Using Literally Canvas in our app,
Literally Canvas provides a user with paint-like functionality in the browser.

Once a user submits their drawing, Paint scores it. This is where we utilize CLIP. CLIP's magic is it is the largest model (to date) trained on image-text pairs. That means CLIP does a really good job of mapping text to images. In our case, we want to map: (1) the text prompt to what image CLIP thinks that prompt should look like and (2) the user submitted drawing to what image representations CLIP knows. The closer (2) is to (1), the higher the user's score.

Before proceeding, let's break CLIP down a bit further. OpenAI trained CLIP on 400 million image and text pairs. In doing so, OpenAI taught a model the embeddings for what image features and text features "go together." Those embeddings are feature vectors (arrays) that are fairly meaningless to humans but enable numeric representation of images and text – and numeric representation means we can perform mathematical operations like cosine similarity. One can provide an arbitrary snippet of text to CLIP and receive back an array for the image embeddings that are best aligned to that text. Similarly, one can provide an image to CLIP and learn how CLIP would embed that image in its feature space. Knowing the embeddings and features for a broad range of images enables one to, for example, identify most visually similar images.

Item 2 and Item 1 have coordinates (X1, X2). We can use cosine similarity to see how close together these points in space are – or even (X1, X2, X3, Xn) – for high dimension spaces.
Image 1 and Image 2 have coordinates (X1, X2). We can use cosine similarity to see how close together these points in space are – or even (X1, X2, X3, Xn) – for high dimension spaces. (Image Adapted From Source)

Again, in our use case, we ask CLIP, "What image embedding matches most closely to X arbitrary text?" where X is the prompt. We then ask CLIP, "What embeddings would you assign to Y arbitrary image" where Y is the user submitted drawing. The cosine similarity among the two embeddings is the score, where the lowest distance is ranked highest. is an AI sandwich: AI generates the prompt, a human draws that prompt, and then AI scores the result.

CLIP: Discoveries, Issues, and Resolutions

In running what we believe to be the largest game with CLIP (yielding a dataset of 150,000+ images), we encountered a number a fascinating issues and discoveries in our process.

CLIP Understands Illustrations

For Paint to work at all, CLIP would have to generalize not only to any arbitrary image, but to any arbitrary digital image. All user submissions were created via a Microsoft Paint-like interface. We truly didn't know if that meant we could reliably have CLIP understand what the images contained.

Surprisingly, CLIP does remarkably well – even on relatively sparse user-submitted images.

CLIP understands simple illustrations.
CLIP understands arbitrary concepts and sparse illustrations – like this relatively primitive, "Man who eats a lot of fish" submission, which ranks #1 for this prompt.

CLIP Can Read

One of the very first discoveries our users exposed to us is that CLIP has an understanding of handwriting. This means that a user could simply write the prompt as text on the canvas and rank relatively highly on the leaderboard.

This so-called typography attack has now been well-documented on CLIP.

CLIP can read.
Typography can confuse CLIP (Source.)

We discovered and built a solution before the now-viral "Apple iPod" went viral. We apply a text penalty to any image that appears to have handwriting in it. How do we know if an image has handwriting? You guessed it – we ask CLIP. Specifically, we calculate the embedding for, "An image that contains handwriting." If the user submitted drawing ranks more closely to the embedding for handwriting than it does for the target prompt, we can penalize the submission by a factor of the submission. (This strategy requires careful tuning: a number of well-drawn submissions also incorporate text, but in a benign way. As an example, check the "World's Most Fabulous Monster" image at the top of this post – the monster is on the cover of Vogue magazine.)

CLIP Can Moderate Content

We built a tool that allows anyone on the internet to make a drawing. As you'd expect, we did not always receive the cleanest of imagery.

To regulate submissions, we can check if a given drawing ranks more closely to a target NSFW topic than the submission does to the actual prompt's embedding. If the user's drawing is more similar to, say, "A drawing of nudity" than "A drawing of an alien that can teleport," we can conclude it likely shouldn't be included in the leaderboard and so we censor it automatically.

An example of an automatically detected NSFW image.

Verdict: Play and Play with CLIP

Our experience playing with CLIP allowed us to scratch the surface on what is possible and exposed blind spots in CLIP's understanding. By trying, you, too, can get a sense of its power.

Try playing as an icebreaker in your next stand up. Good luck!