How to Build an Image-to-Image Search Engine with CLIP and Faiss

Suppose you have a folder of photos you have taken and you want to find all images that match a particular scene. You could have a text-based search engine that, given a text query, returns related results.

With that said, a picture is worth a thousand words. Using pictures as an input is often a faster way to create a precise query which computers can better understand to locate similar results. The reference image would encode more semantics and information than we could provide in a search query.

This type of search is called “image-to-image” search. Given an image and a database of images, you can retrieve the relative proximity of all images to the given image.

In this guide, we are going to show you how to build an image-to-image search engine using CLIP, an open-source text-to-image vision model developed by OpenAI, and faiss, a vector database you can run locally. By the end of this guide, we’ll have a search engine written in Python that returns images related to a provided image.

The steps we will follow are:

  1. Install the required dependencies
  2. Import dependencies and download a dataset
  3. Calculate CLIP vectors for images in our dataset
  4. Create a vector database that stores our CLIP vectors
  5. Search the database

Without further ado, let’s get started!

💡
This tutorial comes with an accompanying Google Colab that you can use to follow along and make your own search engine.

How to Build an Image-to-Image Search Engine

The search engine we will build in this article will return results semantically related to an image. What does this mean? If you upload a photo of a scene in a particular environment, you can retrieve results with similar attributes to that scene. If you upload a photo of a particular object, you can find images with similar objects.

We’ll build a search engine using COCO 128, a dataset with a wide range of different objects, to illustrate how CLIP makes it easy to search images using other images as an input.

With this approach, you can search for:

  1. Exact duplicates to an image;
  2. Near duplicates to an image, and;
  3. Images that appear in a specific scene, or share attributes with the provided image, and more.

The former two attributes can be used to check whether you already have images similar to a specific one in a dataset, and how many. The final attribute enables you to search a dataset by attributes in an image.

Our search engine will be powered by “vectors”, or “embeddings”. Embeddings are “semantic” representations of an image, text, or other data. Embeddings are calculated by a machine learning model that has been trained on a wide range of data.

Embeddings are “semantic” because they encode different features in an image, an attribute which enables comparing two embeddings to find the similarity of images. Similarity comparison is the backbone of image search, the application we are focusing on in this article.

For our search engine, we will use CLIP embeddings. CLIP was trained on over 100 million images, and performs well for a range of image search use cases.

Now that we have discussed how our search engine will work, let’s start building the system!

You can use any folder of images for your search engine. For this guide, we’ll use a the COCO 128 dataset on Roboflow Universe. COCO 128 contains a range of different objects, which makes it a good dataset to illustrate the search tool we are going to build.

Step #1: Install Dependencies

First, we need to calculate CLIP vectors for all the images we want to include in our dataset. To do so, we can use Roboflow Inference. Inference is an open-source, production-ready system you can use for deploying computer vision models, including CLIP.

To install Inference on your machine, refer to the official Inference installation instructions. Inference supports installation via pip and Docker.

We are going to use the Docker installation method in this guide, which enables you to set up a central server for use in calculating CLIP embeddings. This is an ideal deployment option if you need to calculate a large number of vectors.

For instance, the following command installs and starts Inference on a CUDA GPU device:

docker pull roboflow/roboflow-inference-server-gpu

Inference will run at http://localhost:9001 when installed with Docker.

There are a few more dependencies we need to install using pip:

pip install faiss-gpu supervision -q

Replace faiss-gpu with faiss-cpu if you are running on a device without a CUDA-enabled GPU.

With the required dependencies installed, we can start writing our search engine.

Step #2: Import Dependencies

Create a new Python file and paste in the following code:

import base64
import os
from io import BytesIO
import cv2
import faiss
import numpy as np
import requests
from PIL import Image
import json
import supervision as sv

This code will load all of the dependencies we will use.

To download a dataset from your Roboflow account or Roboflow Universe account, create a new Python script and add the following code:

import roboflow

roboflow.login()

roboflow.download_dataset(dataset_url="https://universe.roboflow.com/team-roboflow/coco-128/dataset/2", model_format="coco")

The URL should be the URL to a specific dataset version on Roboflow or Roboflow Universe.

Here is an example image in the dataset:

When you run this code, you will first be asked to authenticate if you have not already signed in to Roboflow via the command line. You only need to run this code once to download your dataset, so it does not need to be part of your main script.

Step #3: Calculate CLIP Vectors for Images

Next, add the following code to the file in which you imported all the project dependencies:

INFERENCE_ENDPOINT = "http://localhost:9001"

def get_image_embedding(image: str) -> dict:
    image = image.convert("RGB")
    
    buffer = BytesIO()
    image.save(buffer, format="JPEG")
    image = base64.b64encode(buffer.getvalue()).decode("utf-8")
    
    payload = {
        "body": API_KEY,
        "image": {"type": "base64", "value": image},
    }
    
    data = requests.post(
    	INFERENCE_ENDPOINT + "/clip/embed_image?api_key=" + API_KEY, json=payload
    )
    
    response = data.json()
    embedding = response["embeddings"]
    return embedding

In this code, we define a new function that calculates an embedding for an image. The function loads an image, sends the image to Inference to retrieve an embedding, and returns that embedding.

Step #4: Create a Vector Database

Now that we can calculate embeddings, we need to create a vector database in which to store them. Vector databases can efficiently retrieve similar vectors, which is essential for our search engine.

Add the following code to the Python script in which we have been working:

index = faiss.IndexFlatL2(512)
file_names = []
TRAIN_IMAGES = os.path.join(DATASET_PATH, "train")

for frame_name in os.listdir(TRAIN_IMAGES):
    try:
        frame = Image.open(os.path.join(TRAIN_IMAGES, frame_name))
    except IOError:
        print("error computing embedding for", frame_name)
        continue

    embedding = get_image_embedding(frame)
    
    index.add(np.array(embedding).astype(np.float32))
    
    file_names.append(frame_name)

 faiss.write_index(index, "index.bin")
 
 with open("index.json", "w") as f:
 	json.dump(file_names, f)

In this code, we create an index that is stored in a local file. This index stores all of our embeddings. We also make a list of the order in which files were inserted, which is needed because to map our vectors back to the images they represent.

We then save the index to a file called “index.bin”. We also store a mapping between the position in which images were inserted into the index and the names of files that the position represents. This is needed to map the insertion position, which our index uses, back to a filename if we want to re-use our index next time we run the program.

Step #5: Search the Database

Now for the fun part: to run a search query!

Add the following code to the Python file in which you have been working:

FILE_NAME = ""
DATASET_PATH = ""
RESULTS_NUM = 3

query = get_image_embedding(Image.open(FILE_NAME))
D, I = index.search(np.array(query).astype(np.float32), RESULTS_NUM)

images = [cv2.imread(os.path.join(TRAIN_IMAGES, file_names[i])) for i in I[0]]

sv.plot_images_grid(images, (3, 3))

In the code above, replace FILE_NAME with the name of the image that you want to use in your search. Replace DATASET_PATH with the path where the images for which you calculated embeddings earlier are stored (i.e. COCO-128-2/train/). This code returns three results by default, but you can return more or less by replacing the value of RESULTS_NUM.

This code will calculate an embedding for a provided image, which is then used as a search query with our vector database. We then plot the top three most similar images.

Consider this image:

When this image is used as a query to our search engine, the following results were returned:

Above, three images related to our query were returned. The first image is the image we used as a search query, which tells us the image we used as a query is in our dataset. If the image wasn't in our dataset, we would see another image.

This property shows the similarity capabilities: images that are close to, or the same as, another image should have the highest similarity. Then, images that are semantically similar (in this case, other images of food) will appear.

If there are no images similar to your query, results will still be returned. This is because we are searching for the three most similar images to a query. This is still useful. If, after visual inspection, no relevant results are returned, we can assume there are no related images in our dataset.

Conclusion

In this guide, we built an image-to-image search engine with CLIP. This search engine can take an image as an input and return semantically similar images. We used CLIP to calculate embeddings for our search engine, and faiss to store them and run searches.

This search engine could be used to find duplicate or similar images in a dataset. The former use case is useful for auditing a dataset. The latter use case could be presented as a search engine for a media archive, among many other use cases.