How to Build a Photo Memories App with CLIP

Many popular photos applications have features that collate photos into slideshows, sometimes referred to as “memories.” These slideshows are centered around a theme such as a particular location, person, or a concept common across your photos.

Using CLIP, an image model developed by OpenAI, we can build a photo memories app that groups photos according to a specified theme. We can then collate the images retrieved by CLIP into a video that you can share with friends and family.

Here is the memories slideshow we make during the tutorial:

0:00
/

Without further ado, let’s get started!

How to Build a Photo Memories App with CLIP

To build our photo memories app, we will:

  1. Install the required dependencies
  2. Use CLIP to calculate embeddings for each image in a folder
  3. Use CLIP to find related images given a text query (i.e. “people” or “city”)
  4. Write logic to turn related images into a video
  5. Save the slideshow we have generated

You may be wondering: “what are embeddings?” Embeddings are numeric representations of images, text, and other data that you can compare. Embeddings are the key to this project: we can compare text and image embeddings to find images related to the themes for which we want to make memories.

Step #1: Install Required Dependencies

Before we can start building our app, we need to install a few dependencies. Run the following command to install the Python packages we will use in our application:

pip install faiss-cpu opencv-python Pillow

(If you are working on a computer with a CUDA-enabled GPU, install faiss-gpu instead of faiss-gpu)

With the required dependencies installed, we are now ready to start building our memories app.

Start by importing the required dependencies for the project:

import base64
import os
from io import BytesIO
import cv2
import faiss
import numpy as np
import requests
from PIL import Image
import json

We are going to use the Roboflow Inference Server for retrieving CLIP embeddings. You can host the Inference Server yourself, but for this guide we will use the hosted version of the server.

Add the following constant variables to your Python script, which we will use later on to query the inference server.

INFERENCE_ENDPOINT = "https://infer.roboflow.com"
API_KEY = "API_KEY"

Replace the `API_KEY` value with your Roboflow API key. Learn how to find your Roboflow API key.

Now, let’s start working on the logic for our application.

Step #2: Calculate Image Embeddings

Our application is going to take a folder of images and a text input. We will then return a slideshow that contains images related to the text input. For this, we need to calculate two types of embeddings:

  1. Image embeddings for each image, and;
  2. A text embedding for the theme for a slideshow.

Let’s define a function that calls the Roboflow Inference Server and calculates an image embedding:

def get_image_embedding(image: str) -> dict:
    image = image.convert("RGB")

    buffer = BytesIO()
    image.save(buffer, format="JPEG")
    image = base64.b64encode(buffer.getvalue()).decode("utf-8")

    payload = {
        "body": API_KEY,
        "image": {"type": "base64", "value": image},
    }

    data = requests.post(
        INFERENCE_ENDPOINT + "/clip/embed_image?api_key=" + API_KEY, json=payload
    )

    response = data.json()

    embedding = response["embeddings"]

    return embedding

Next, let’s define another function that retrieves a text embedding for a query:

def get_text(prompt):
    text_prompt = requests.post(
        f"{INFERENCE_ENDPOINT}/clip/embed_text?api_key={API_KEY}", json={"text": prompt}
    ).json()["embeddings"]

    return np.array(text_prompt)

Step #3: Create an Index

The two functions we wrote in the previous step both return embeddings. But we haven’t written the logic to use them yet! Next, we need to calculate image embeddings for a folder of images. We can do this using the following code:

index = faiss.IndexFlatL2(512)
image_frames = []

for frame in os.listdir("./images"):
    frame = Image.open("./images/" + frame)

    embedding = get_image_embedding(frame)

    index.add(np.array(embedding))

    image_frames.append(frame)

with open("image_frames.json", "w+"):
    json.dumps(image_frames)

faiss.write_index(index, "index.bin")

This code creates an “index”. This index will store all of our embeddings. We can efficiently search this index using text embeddings to find images for our slideshow.

At the end of this code, we save the index to a file for later use. We also save all of the image frame file names to a file. This is important because the index doesn’t store these, and we need to know with what file each frame in the index is associated so we can make our slideshow.

Step #4: Retrieve Images for the Slideshow

Next, we need to retrieve images for our slideshow. We can do this with a single line of code:

query = get_text("san francisco")
D, I = index.search(query, 3)

In the first line of code, we call the get_text() function we defined earlier to retrieve a text embedding for a query. In this example, our query is “san francisco”. Then, we search our image index for images whose embeddings are similar to our text embedding.

This code will return images ordered by their relevance to the query. If you don’t have any images relevant to the query, results will still be returned, although they will not be useful in creating a thematic slideshow. Thus, make sure you search for themes you know are featured in your images.

The 3 value states we want the top three images associated with our text query. You can increase or decrease this number to retrieve more or fewer images for your slideshow.

Step #5: Find Maximum Image Width and Height

There is one more step we need to complete before we can start creating slideshows: we need to find the largest image width and height values in the images we will use to create each slideshow. This is because we need to know at what resolution we should save our video.

To find the maximum width and height values in the frames we have gathered, we can use the following code:

video_frames = []
largest_width = 0
largest_height = 0

for i in I[0]:
    frame = image_frames[i]
    cv2_frame = np.array(frame)
    cv2_frame = cv2.cvtColor(cv2_frame, cv2.COLOR_BGR2RGB)

    video_frames.extend([cv2_frame] * 20)

    height, width, _ = cv2_frame.shape

    if width > largest_width:
        largest_width = width

    if height > largest_height:
        largest_height = height

Step #6: Generate the Slideshow

We’re onto the final step: create the slideshow. All of the pieces are in place. We have found images related to a text query, and calculated the resolution we will use for our slideshow. The final step is to create a video that uses the images.

We can create our slideshow using the following code:

final_frames = []

for i, frame in enumerate(video_frames):
    if frame.shape[0] < largest_height:
        difference = largest_height - frame.shape[0]
        padding = difference // 2

        frame = cv2.copyMakeBorder(
            frame,
            padding,
            padding,
            0,
            0,
            cv2.BORDER_CONSTANT,
            value=(0, 0, 0),
        )
    
    if frame.shape[1] < largest_width:
        difference = largest_width - frame.shape[1]
        padding = difference // 2

        frame = cv2.copyMakeBorder(
            frame,
            0,
            0,
            padding,
            padding,
            cv2.BORDER_CONSTANT,
            value=(0, 0, 0),
        )
    
    final_frames.append(frame)

video = cv2.VideoWriter(
    "video1.avi", cv2.VideoWriter_fourcc(*"MJPG"), 20, (largest_width, largest_height)
)

for frame in final_frames:
    video.write(frame)

cv2.destroyAllWindows()
video.release()

This code creates a big list of all of the frames we want to include in our image. These frames are padded with black pixels according to the maximum height and width we identified earlier. This ensures images are not stretched to fit exactly the same resolution as the largest image. We then add all of these frames to a video and save the results to a file called video.avi.

Let’s run our code on a folder of images. For this guide, we have run the memories app on a series of city photos. Here is what our video looks like:

0:00
/

We have successfully generated a video with images related to “san francisco”.

Conclusion

CLIP is a versatile tool with many uses in computer vision. In this guide, we have demonstrated how to build a photo memories app with CLIP. We used CLIP to calculate image embeddings for all images in a folder. We then saved these embeddings in an index.

Next, we used CLIP to calculate a text embedding that we used to find images related to a text query. In our example, this query was “san francisco”. Finally, we completed some post-processing to ensure images were all the same size, and compiled images related to our query into a slideshow.