How to Build a Semantic Image Search Engine with Roboflow and CLIP
Published Mar 30, 2023 • 9 min read

Historically, building a robust search engine for images was difficult. One could search by features such as file name and image metadata, and use any context around an image (i.e. alt text or surrounding text if an image appears in a passage of text) to provide richer searching feature. This was before the advent of neural networks that can identify semantically related images to a given user query.

OpenAI's Contrastive Language-Image Pre-Training (CLIP) model provides the means through which you can implement a semantic search engine with a few dozen lines of code. The CLIP model has been trained on millions of pairs of text and images, encoding semantics from images and text combined. Using CLIP, you can provide a text query and CLIP will return the images most related to the query.

In this guide, we're going to walk through how to build a semantic search engine on a folder of images using CLIP.

We'll walk through two ways of implementing the search engine:

  1. A short HTTP request that you can make to query a CLIP embedding search API available for all Roboflow Universe datasets and any other dataset associated with your account;
  2. A more in-the-weeds demo of using CLIP yourself to build a semantic search engine.

Without further ado, let's get started!

Introduction to Embeddings

Embeddings are a numeric representation of data such as text and images. Embeddings are calculated using a model such as CLIP, which was trained on pairs of images and text. Through the CLIP training process, the model learned to encode semantics about the contents of images. We can create a search engine with embeddings. To do so, we need to:

  1. Calculate embeddings for all of the images in our dataset;
  2. Calculate a text embedding for a user query (i.e. "hard hat" or "car") and;
  3. Compare the text embedding to the image embeddings to find related embeddings.

The closer two embeddings are, the more similar the documents they represent are.

Let's talk through how to query to the Roboflow API to run semantic search on a dataset on Roboflow Universe.

Semantic Search with Roboflow

When you upload an image to Roboflow, we calculate an image embedding for the image using CLIP. The Roboflow API provides a /search endpoint through which you can query your dataset and find images related to a query. To use this endpoint, you'll need a free Roboflow account.

To get related documents to a document, we can use this Python code:

import requests

WORKSPACE_ID = "team-roboflow"
DATASET_NAME = "coco-128"
data = {"prompt": "zebra", "limit": 10, "fields": ["id", "name"], "offset": 0}

response = requests.post(
    f"https://api.roboflow.com/{WORKSPACE_ID}/{DATASET_NAME}/search?api_key=API_KEY",
    data=data,
).json()

You'll need to add your API key into the code above. You can learn how to find your key in our API key documentation. You should replace the workspace ID and dataset name with the values associated with the dataset you want to search. You can learn how to find these values in our workspace and dataset ID documentation.

This code will return the 10 images most similar to the search term "zebra" from the "coco-128" dataset on the "team-roboflow" account. Let's run our code and see what happens:

{'offset': 0, 'total': 378, 'results': [{'id': 'pHMJbpwWCwIO4n3cwo3o', 'name': '000000000034.jpg'}, {'id': '19a2YKODRRQI38gpxnAc', 'name': '000000000034.jpg'}, {'id': 'fkH8Fhm1CHyoNC1Vxx8C', 'name': '000000000034.jpg'}, {'id': 'X908XhEJEihYhLDC0cxt', 'name': '000000000154.jpg'}, {'id': 'HJUhhoMTo7Bqi9bq7AJi', 'name': '000000000154.jpg'}, {'id': 'IAyBhY7reAl53aRAVfBg', 'name': '000000000154.jpg'}, {'id': 'xcWpBZ7NY39Sgzq8x1uo', 'name': '000000000459.jpg'}, {'id': 'YgKYx9ryGtBDi3WLdyYA', 'name': '000000000459.jpg'}, {'id': 'dOBJwRhvv3HihjinVDQa', 'name': '000000000650.jpg'}, {'id': '8B6wIgwBxrYWkW4KPCSp', 'name': '000000000650.jpg'}]}

The API has returned an endpoint with 10 images.

If you have the files in your dataset available locally, you can use the above JSON result to retrieve those files.

If you would like a URL to the version of your image that we have stored, you can make a query to our /image API endpoint:

for result in response["results"]:
    get_image = requests.get(
      f"https://api.roboflow.com/{WORKSPACE_ID}/{DATASET_NAME}/image/{result['id']}?api_key=API_KEY",
        data=data,
    ).json()["image"]["urls"]["original"]

This will return the URL associated with each image.

Manual Semantic Search with CLIP

Step 1: Load the Data

Before we build a search engine, we need some data with which to work. For this project, we're going to use the images in the COCO 128 dataset from Roboflow Universe with which we worked earlier (Note: You can use any folder of images for this project). To download this dataset, first go to the dataset homepage on Roboflow Universe. If you don't already have a free Roboflow account, you will need to create one.

Then, click "Download this Dataset". When you click the button, make sure the "show download code" button is checked, then click "Continue". After a few moments, you will receive three download options: Jupyter, terminal, and raw URL. Click "Jupyter", then copy the code presented on the page into a file called "data.py":

!pip install roboflow

from roboflow import Roboflow
rf = Roboflow(api_key="YOUR_API_KEY")
project = rf.workspace("team-roboflow").project("coco-128")
dataset = project.version(2).download("coco")

If you don't already have the Roboflow pip package installed, remove the pip install Roboflow command from the top of the snippet and run it in your console:

pip install roboflow

Step 2: Embed Images

With the images for our search engine ready, we can begin work on our search engine!

Our search engine is going to follow these steps:

  1. Calculate image "embeddings" for all of the images in our folder using CLIP. Embeddings are a numerical representation of a piece of image or text data.
  2. Save embeddings, alongside the data they represent, to a faiss vector store for reference.
  3. Ask a user for a query.
  4. Calculate a text embedding for the user's query using CLIP.
  5. Use CLIP to retrieve the images with embeddings most closely related to our text embedding. (The closer embeddings are, the more similar their contents are.)
  6. Return the names of the top 3 images.

Let's get to work!

First, let's import the requisite dependencies and initialize CLIP:

import os

import clip
import torch
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity

HOME_DIR = "/Users/james/Downloads/COCO 128.v2-640x640.coco/train/"

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

images = []

If a GPU is available with CUDA, CLIP will use the GPU. Otherwise, CLIP will use the CPU.

Next, we're going to write a function that will calculate embeddings for each image in the HOME_DIR directory above. We'll save all of these embeddings to a faiss store, a data store through which we can efficiently search embeddings. We'll also create a JSON file that stores a list of all of the images for which we calculate embeddings. The index of each item in the JSON file will map to its position in the vector store. That means if "apple.jpeg" is the 50th file indexed, it will have an index equal to 50 in the vector store.

def retrieve_embeddings():
  if os.path.exists("index.bin"):
      index = faiss.read_index("index.bin")
      with open("references.json", "r") as f:
          data = json.load(f)
  else:
      index = faiss.IndexFlatL2(512)

      images = []

      for item in os.listdir(HOME_DIR):
          if item.lower().endswith((".jpg", ".jpeg", ".png")):
              image = (
                  preprocess(Image.open(os.path.join(HOME_DIR, item)))
                  .unsqueeze(0)
                  .to(device)
              )
              images.append((item, image))
          else:
              continue

      data = []

      for i in images:
          with torch.no_grad():
              image_features = model.encode_image(i[1])
              image_features /= image_features.norm(dim=-1, keepdim=True)

              data.append(
                  {
                      "image": i[0],
                      "features": np.array(image_features.cpu().numpy()).tolist(),
                  }
              )

              index.add(image_features.cpu().numpy())

      faiss.write_index(index, "index.bin")

      with open("references.json", "w") as f:
          json.dump(data, f)
          
 return index, data

This function will save a vector store and the JSON index to a local file. If the vector store exists, it is loaded from the local file. This means that you don't need to recompute the embeddings every time the program runs.

Alternatively, you can query the Roboflow Inference API (using our hosted infer.roboflow.com endpoint or our self-hosted inference API) to calculate CLIP embeddings for an image or a text query. If you query infer.roboflow.com, you can reduce the computational power required to calculate embeddings for your search engine. To calculate an image embedding, you can use this code:

embedding = requests.post(
    f"https://infer.roboflow.com/clip/embed_image?api_key=API_KEY",
    json={"image": [{"type": "base64", "value": base64.b64encode(open(os.path.join(HOME_DIR, image), "rb").read()).decode("utf-8")}]},
).json()["embeddings"][0]

If you want to use the Roboflow API to calculate embeddings, here is the full function you'll need:

def retrieve_embeddings():
  if os.path.exists("index.bin"):
      index = faiss.read_index("index.bin")
      with open("references.json", "r") as f:
          data = json.load(f)
  else:
      index = faiss.IndexFlatL2(512)

      images = []

      for image in os.listdir(HOME_DIR):
          if image.lower().endswith((".jpg", ".jpeg", ".png")):
              clip_embed = requests.post(
                  f"https://infer.roboflow.com/clip/embed_image?api_key=API_KEY",
                  json={"image": [{"type": "base64", "value": base64.b64encode(open(os.path.join(HOME_DIR, image), "rb").read()).decode("utf-8")}]},
              )

          		index.add(embedding["embeddings"][0])

      faiss.write_index(index, "index.bin")

      with open("references.json", "w") as f:
          json.dump(data, f)
          
 return index, data

Then, we need to get a search query. Let's use the Python input() function to retrieve a query for the purposes of this guide, then calculate a text embedding for that query:

query = input("Enter a search query: ")

tokenized_query = clip.tokenize([query]).to(device)

You can also retrieve the text embedding for the query using the Roboflow API:

tokenized_query = requests.post(
    f"https://infer.roboflow.com/clip/embed_text?api_key=API_KEY",
    json={"text": query}
).json()["embeddings"][0]

Next, we can use faiss to find the 3 images that are most similar to our query. Note that "most similar" does not necessarily mean that these images will be related to our query.

with torch.no_grad():
    text_features = model.encode_text(tokenized_query)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    D, I = index.search(text_features.cpu().numpy(), k=3)

    for i in I[0]:
        print(data[i]["image"])
        # open image
        # image = Image.open(os.path.join(HOME_DIR, data[i]["image"]))
        # image.show()

We query the vector store using the index.search() method. We pass the k=3 argument to state that we only want to retrieve the three most relevant results. Then, we find the name of each of the three most relevant results using the "data" JSON list we created earlier. We print each file name to the console.

If we ask for images matching a term like "dinosaur" and ask for related images, the images returned will not be related because our underlying dataset does not contain any dinosaurs.

If you are using the Roboflow API to calculate embeddings, you can retrieve related images using this code:

D, I = index.search(tokenized_query, k=3)

for i in I[0]:
    print(data[i]["image"])
    # open image
    # image = Image.open(os.path.join(HOME_DIR, data[i]["image"]))
    # image.show()

We're now ready to test our search engine!

Step 3: Test the Search Engine

To test the search engine, run the app.py script that we wrote in the last step. You will be asked to enter a query. Type in a query for which there will be related matches in your dataset, then press enter. In our example, let's type in the query "zebra", a class we know is in our dataset. Here are the results:

/Users/james/Downloads/COCO 128.v2-640x640.coco/train/000000000034_jpg.rf.a33e87d94b16c1112e8d9946fee784b9.jpg
/Users/james/Downloads/COCO 128.v2-640x640.coco/train/000000000154_jpg.rf.300698916140dd41f6fda1c194d7b00d.jpg
/Users/james/Downloads/COCO 128.v2-640x640.coco/train/000000000597_jpg.rf.2c8f04559f193762dc844986f6d60cad.jpg

We have three images to review. Here are the contents of the top three images:

The images are presented in order of relevance. Two images of zebras were found. One image of an outdoor environment was returned in which no zebras were present. This is because our code has been instructed to return the three most relevant results; the third photo is more related than the rest of the images in our dataset. If we had more images of zebras in our dataset, they would likely be surfaced above the third photo in the image above.

If you want to test the search engine on a different folder of images, you will need to delete or move the index.bin and references.json files. By default, the script will use these files if they are present instead of computing embeddings again.

Evaluating Running a Self-Hosted CLIP Search Engine

In this guide, we have walked through two methods you can use to build a self-hosted CLIP search engine: run your own, computing CLIP vectors on your own machine (or via the Roboflow API), or use the CLIP embeddings Roboflow computes for your images as an out-of-the-box solution.

By calculating CLIP vectors on your own machine and storing them in a vector store, there are many considerations to keep in mind. First, indexes can get large as your dataset size increases, which means care will need to be taken to monitor storage requirements. Second, after you have the core CLIP code written, you will still need to wrap the code around an API and/or a web interface for use in your code. Third, you will need to write in more logic to associate images with other metadata.

Using Roboflow's search API lets you build a CLIP semantic search feature on your dataset that works out-of-the-box and will scale with your needs. As you add more images, we'll compute CLIP vectors for you and provide a robust endpoint through which you can query your index.

Conclusion

In this guide, we have used CLIP to implement a semantic search engine on a folder of images. We use CLIP to calculate image embeddings for all of the images in a dataset we downloaded from Roboflow Universe. We save these embeddings, alongside a list that we can use to map each embedding to an image file name, to our local machine. These embeddings are calculated when the script is first run; on subsequent runs, the embedding index is opened from the local machine and queried.

We ask a user for a query, calculate a text embedding associated with the query, and then use our vector index to find the images most related to the user query. We displayed the names of the top three relevant images to the console. To extend this project, you could build a web interface that displays the results visually. This will both make debugging easier and provide a more intuitive interface through which you can interact with your search engine.

Cite this Post

Use the following entry to cite this post in your research:

James Gallagher. (Mar 30, 2023). How to Build a Semantic Image Search Engine with Roboflow and CLIP. Roboflow Blog: https://blog.roboflow.com/clip-semantic-search/

Discuss this Post

If you have any questions about this blog post, start a discussion on the Roboflow Forum.

Written by

James Gallagher
James is a technical writer at Roboflow, with experience writing documentation on how to train and use state-of-the-art computer vision models.

Topics