Modern computer vision models can power rich image search experiences wherein you can query a database of images using the semantics in images.

For example, you could search a database of landmark images with a query like “San Francisco” to find images that relate to San Francisco. You can also search for images that are related to another image, which is useful when you need to search a database for the closest image to a given image.

The use cases for image search in enterprise are vast. You can build media search engines for catalogs of content with millions of images without having to manually tag each image. You can build advanced systems to find whether an image is closely related to another in your dataset.

In this guide, we are going to walk through how to build an image search engine. We will use CLIP, a tool that lets you compare images. We will run CLIP on Intel’s Gaudi2 system, which has been optimized to run CLIP.

Gaudi2 is designed for scale, making the system an ideal use case for enterprise applications that work with millions of images. One Gaudi2 chip has 24 Tensor processor cores, 96 GB of HBM2E memory, dual matrix multiplication engines, and more.

This tutorial assumes you have a Gaudi2 system set up. To learn more about setting up a Gaudi2 system, refer to the Intel Cloud documentation.

Without further ado, let’s get started!

What is CLIP?

Contrastive Language Image Pre-training (CLIP) is an open source computer vision model developed by OpenAI. With CLIP, you can calculate embeddings for images and text. Image embeddings encode the semantic detail in an image.

You can compare image embeddings to find the most similar image to a given image. You can compare text embeddings to image embeddings to find the most similar image to a given text query (i.e. find images related to the query “San Francisco”).

There is no “master list” of prompts. CLIP is open vocabulary, which means you can provide any text prompt you want in searching a database of image embeddings.

To build a search engine with CLIP, we need to:

  1. Install and set up CLIP.
  2. Calculate CLIP embeddings (also called “vectors”) for each image in a dataset.
  3. Save the embeddings in a vector database. Vector databases are efficient ways to search embeddings.
  4. Write logic that searches our database.

Let’s walk through each of these steps.

Step #1: Install CLIP and Faiss

We are going to use a version of CLIP implemented in HuggingFace Transformers, an open source Python package used to run machine learning models. The Transformers implementation of CLIP has been optimized for Gaudi2 through the Hugging Face Optimum program.

To install Transformers, run the following command on your system:

pip install transformers

When we calculate CLIP vectors, we need to save them somewhere that they can be efficiently searched. We are going to save these vectors in a vector database. For this guide, we will use Faiss, which has been optimized to scale to millions of images. Faiss will run on our server. With that said, in production you may choose to use a vector database on a separate server depending on your project architecture.

To install Faiss, run the following command:

pip install faiss-cpu

In addition, we need to install the following dependencies:

pip install Pillow tqdm

Now that we have the required dependencies installed, we can start writing our application logic.

Step #2: Dataset Preparation

For this guide, we are going to calculate CLIP vectors for a dataset of world landmarks. We will use the train split of a landmarks image dataset. The train split of this dataset contains 7053 images. We will use the train set to build an image search engine of landmarks.

We can download our dataset from Roboflow using the following code:

!pip install roboflow

from roboflow import Roboflow
rf = Roboflow(api_key="API_KEY")
project = rf.workspace("kg-dqu3z").project("landmark-s1jsp")
dataset = project.version(1).download("clip")

Above, replace API_KEY with your Roboflow API key. You can also use any folder of images you have for use in this tutorial.

The dataset will be downloaded into a folder called Europe-Landmarks-Classification-1. The dataset contains many nested folders, but we can move every image into a single folder for use in calculating CLIP vectors using the following bash script:

#!/bin/bash

cd Europe-Landmarks-Classification-/
# Define the base directory where the folders are located
base_dir="train"

# Loop through each folder in the base directory
for folder in "$base_dir"/*; do
    if [ -d "$folder" ]; then  # Check if it is a directory
        # Move all files from the subfolder to the base directory
        mv "$folder"/* "$base_dir/"
    fi
done

This script will move every image from a sub-folder in the train directory to the root of the train directory for use in building our system.

The Gaudi2 chip allows for fast computation of CLIP vectors. In testing, it took 2m27s to calculate 7053 image vectors.

Step #3: Create a Vector Database

We need a vector database in which we can store our embeddings. We will use this vector database to search our embeddings. Create a new Python file and add the following code:

import os

import faiss

try:
    import habana_frameworks.torch.core as htcore

    DEVICE = "hpu"
except ImportError:
    DEVICE = "cpu"

import numpy as np
import tqdm
from PIL import Image

import torch
import json

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(DEVICE)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def get_image_embedding(image):
    inputs = processor(images=[image], return_tensors="pt", padding=True).to(DEVICE)

    outputs = model.get_image_features(**inputs)

    return outputs.cpu().detach().numpy()

def get_text_embedding(text):
    inputs = processor(text=[text], return_tensors="pt", padding=True).to(DEVICE)

    outputs = model.get_text_features(**inputs)

    return outputs.cpu().detach().numpy()


DATASET_PATH = "Europe-Landmarks-Classification-2"
SAVE_INTERVAL = 100

index = faiss.IndexFlatIP(512)
file_names = []
TRAIN_IMAGES = os.path.join(DATASET_PATH, "train")

for i, file in tqdm.tqdm(enumerate(os.listdir(TRAIN_IMAGES))):
    try:
        frame = Image.open(os.path.join(TRAIN_IMAGES, file))
    except IOError:
        print("error computing embedding for", file)
        continue

    embedding = get_image_embedding(frame)

    index.add(embedding)

    file_names.append(file)

    if i % SAVE_INTERVAL == 0:
        faiss.write_index(index, "index.bin")

faiss.write_index(index, "index.bin")

with open("index.json", "w") as f:
    json.dump(file_names, f)

In this code, we calculate a CLIP vector for every image in our dataset. We save those vectors in an index called “index.bin”. Every 500 vectors we calculate, we save the last 500 embeddings in our index.

We also save a file called “index.json”. This file stores metadata about each image and a mapping between each image name and its location in our embedding database.

As this script runs, a progress bar will appear which shows how many images have been processed.

Step #4: Search the Image Vector Database

Once you have calculated image embeddings for all the images in your dataset, you can build logic to search the image vector database.

When a user performs a search, we will need to calculate a CLIP vector for their search term. This process will be almost instantaneous as a result of the speed of the Gaudi2 chip.

You can search the database in two ways:

  1. Using a text query, or;
  2. Using an image query.

Let’s try both methods. First, let’s search our database to find landmarks related to the text prompt “palace”.

Create a new Python file and add the following code:

import os

import faiss
try:
    import habana_frameworks.torch.core as htcore

    DEVICE = "hpu"
except ImportError:
    DEVICE = "cpu"

import numpy as np
import tqdm
from PIL import Image

import torch
import json

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(DEVICE)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") #.to(DEVICE)

def get_image_embedding(image):
    inputs = processor(images=[image], return_tensors="pt", padding=True).to(DEVICE)

    outputs = model.get_image_features(**inputs)

    return outputs.cpu().detach().numpy()

def get_text_embedding(text):
    inputs = processor(text=[text], return_tensors="pt", padding=True).to(DEVICE)

    outputs = model.get_text_features(**inputs)

    return outputs.cpu().detach().numpy()

index = faiss.read_index("index.bin")

with open("index.json", "r") as f:
    file_names = json.load(f)

def get_similar_images(prompt, k=5):
    if isinstance(prompt, str):
        embedding = get_text_embedding(prompt)
    else:
        embedding = get_image_embedding(prompt)

    _, indices = index.search(embedding, k)

    return [file_names[i] for i in indices[0]]

print("palace")
print(get_similar_images("palace"))

In this code, we load our faiss index. We calculate a text embedding for our prompt, “palace”. Then, we pass in the text embedding to our vector database to find images whose embeddings are closest to our image. We use the “index.json” we created in the last step to map the results from the vector database back to an image file name.

This script prints the name of the most related three files to our text prompt. Here are the results, when opened:

We were able to successfully search our dataset of images. All results above are palaces. The first two images are Wilanow Palace in Poland. The third image is the Palace on the Isle, also in Poland.

With these image search capabilities, you can build rich, semantic search experiences. These experiences could be internally-facing, for example for use in building a large, private media index. Or you could use image search as an external-facing experience, such as in the case of an app that lets someone search for destinations by phrase (i.e. palace) and have places recommended to them.

Let's search for cathedral. Here are the top two results:

Despite the different artistic styles in the images, the search engine was able to successfully identify two images of cathedrals. In the examples above, the images are the Sagrada Familia cathedral in Barcelona, Spain.

You can also search the database to find images most related to another image using the following code:

print(get_similar_images(Image.open(os.path.join(TRAIN_IMAGES, "arc_de_triomphe_15_jpg.rf.a8e82dff1fd5ec6eacae06483c02bedc.jpg"))))

For example, we can provide this photo of the Arc de Triomphe in Paris, France:

Our script says the following images are closest to the provided photo:

Our search engine successfully identified similar images. If there were no duplicates in the dataset, we would see other buildings in the style of the Arc de Triomphe. Indeed, our search engine is designed to return the most similar images to a given image. The example above shows both the potential of our search engine as well as the importance of deduplicating your dataset.

To learn more about deduplicating a dataset with Gaudi2, refer to our Build Enterprise Datasets with CLIP for Multimodal Model Training Using Intel Gaudi2 HPUs guide.

Integrating Image Search with Multimodal Language Models

Embedding-based image search is a form of Retrieval Augmented Generation (RAG). RAG is commonly used to retrieve information relevant for a prompt that will be sent to a Large Multimodal Model (LMM). You can use the code above as a component in a RAG-based system for retrieving images.

For example, consider a scenario where you are building an experience that allows travelers to explore their ideal destination.

You could have a traveler type in a description (i.e. “skyscrapers” or “Gothic architecture” or “prairie landscapes”). Then, you could use the image search system built above to retrieve images related to a search. These images could be fed into an LMM which enriches the results of the search query, providing people with detailed advice on finding a holiday destination that suits them.

In addition, you could use the code in this guide to build a retrieval system for a media archive. You could divide a video into frames and apply the logic outlined in this guide. Internal stakeholders could use this search engine to check for the presence of specific objects in a scene (i.e. coffee cups which may have been left over during production) or to find an episode that meets a specific description for use in planning a seasonal programming schedule.

Conclusion

The Intel Gaudi2 system can be used to run CLIP at scale. In this guide, we walked through how to build an image-to-image search engine with CLIP using Gaudi2. We used the faiss vector database as an example method by which to store CLIP embeddings, but you can use any vector database in your application.

The Intel Gaudi2 hardware is useful in applications where speed is critical and you expect heavy load from clients. In this guide, we calculated 7053 CLIP vectors in 2m27s. We then used the Gaudi2 system to calculate a text embedding for a search query. We searched our database and returned the most related images to our text prompt.

To learn more about the Gaudi2 system, refer to the Gaudi2 Habana AI product specification.