This article was contributed to the Roboflow blog by Abirami Vina.

Have you ever noticed how some apps automatically suggest filters based on the scenery in your photos? The AI technique behind this is called scene classification. Scene classification is a computer vision task that identifies and categorizes scenes from photographs or video frames. It helps us understand the context of a scene, and it's something you can easily implement in your own applications using a scene classification API.

In this guide, we'll dive into what scene classification is, its applications, and how to use the Roboflow scene classification API. We’ll also walk through a practical example of how to use the scene classification API for weather monitoring. Let’s dive right in!

What is Scene Classification?

Scene classification is a computer vision technique that can sort images or video frames into different classes or categories. The sorting is based on what's in the scene or background, which is why it's called 'scene classification.' Training data for this technique involves large datasets with lots of different real-world scenes, both natural and man-made. 

At its core, scene classification aims to recognize the semantic meaning or context of a scene based on its visual appearance. It involves analyzing the colors, textures, shapes, and spatial arrangements of objects within the scene to make an informed decision about their category. For instance, a scene with trees, grass, and a blue sky might be labeled as a "park," while a scene with tall buildings and busy streets might be labeled as a "cityscape."

Examples of Scene Classification (Source)

The Evolution of Scene Classification

Before the rise of deep learning, efforts were made to create a computational model of holistic scene recognition based on the 'Spatial Envelope,' a low-dimensional representation of the scene. These efforts led to the creation of significant datasets, such as Places365, which are considered crucial for the success of deep learning in scene recognition and classification research.

Now, thanks to deep learning, particularly Convolutional Neural Networks (CNNs), scene classification has received a lot of attention from the computer vision community. It also makes transfer learning possible. Transfer learning is a useful and efficient option, especially when you don't have a ton of data.

Scene Classification vs. Image Classification

Scene classification and image classification are similar but attempt to do different tasks. Image classification focuses on finding the main object in a picture. Like, if there's a big cat in a photo, image classification would classify the image as a cat. But scene classification looks at the whole picture to figure out where it is or what's going on. So, if there's a photo of a cat chilling on a beach, scene classification would say it's a picture of a beach because of the sand, palm trees, and ocean, even though the cat is the main object.

Knowing the difference is important, depending on what your computer vision requirement is. If you want to sort pictures by what's in them, use image classification. But if you want to group them by where they are or what's happening around the main object, scene classification may be a better option.

The Roboflow Scene Classification API

Now that we have an idea of what scene classification is, let’s understand how the Roboflow Scene Classification API can help you integrate this capability into your projects.

The Roboflow Scene Classification API lets anyone integrate scene classification capabilities into their applications or systems. These are the steps involved when you invoke this API:

  • Input: The Roboflow API accepts input data in the form of images or video frames.
  • Feature Extraction: The API uses a zero-shot classification model called CLIP to extract features from the input images or frames. These features capture important visual characteristics of the scenes, such as colors, textures, and shapes. 
  • Classification: The extracted features are then used for scene classification. The CLIP model uses these features to predict the scene category for each input image or frame. Developers will predetermine the scene categories in the form of ‘prompts’
  • Output: The API returns the predicted scene category or categories for each input image or frame. This output will include the top N predicted categories along with their confidence scores.

How Can the CLIP Model Be Used for Scene Classification?

CLIP, which stands for Contrastive Language-Image Pretraining, is a model created by OpenAI. It can process more than just images. The CLIP model can learn from both pictures and the words people use to describe them. While other models require numerous labeled pictures to learn, CLIP has been trained on countless picture-word pairs found online.

This capability enables it to understand the connections between words and pictures. Consequently, CLIP can perform various tasks, including image classification, image captioning, and even zero-shot learning, where it can classify images it hasn't seen during training solely by comprehending their textual descriptions. 

Simply put, we can provide the CLIP model with images it has never encountered before and expect it to classify them based on specific text inputs (classes) that we provide. Before we see this in action, let’s explore different applications where scene classification may come in handy.

Scene Classification Use Cases

Let's explore three common use cases of scene classification: content moderation, sports video summarization, and climate and weather monitoring.

Content Moderation Using Scene Classification

Scene classification can help keep platforms like YouTube a safe and appropriate place for everyone. By automatically detecting what kind of setting or environment is shown in videos, these algorithms can identify content that breaks the rules. 

For example, if a video is classified as having violent scenes or adult content, it can be automatically flagged to be reviewed by human moderators or even removed entirely. This helps stop harmful or offensive content from spreading and protects users from seeing inappropriate content they shouldn't see. Scene classification technology helps major platforms like YouTube maintain a safe, positive user experience by using these algorithms to perform an initial screening pass.

Sports Video Summarization Using Scene Classification

Scene classification is a game-changer for sports video summarization, allowing fans to relive the most exciting moments without having to watch the entire match. In cricket, for example, scene classification algorithms can automatically identify key events like boundaries, wickets, and player celebrations by analyzing the visual cues within the video frames.

By recognizing the specific scenes associated with these exciting moments, AI can automatically generate highlight reels that capture the essence of the match. Moreover, scene classification can be used to create personalized highlight reels based on individual preferences, such as focusing on a specific player or type of event. It truly enhances the viewing experience for sports enthusiasts.

Climate and Weather Monitoring Using Scene Classification

Another great application of scene classification is climate and weather monitoring. It involves analyzing images and video frames captured by high-quality cameras to monitor changes in weather and climatic conditions. This application has plenty of use cases, such as weather phenomena detection, vegetation and land cover analysis, and long-term climate change studies. The data and information gathered through this can help scientists and researchers understand the impact of climate change and plan for future mitigation efforts. In the next section, we’ll go through a simple code example that uses scene classification to identify and classify seasonal changes. 

How to Use Scene Classification on a Video

Let's walk through a code example that implements scene classification to predict seasonal changes in a video by analyzing it frame by frame and displaying the output on the frames. If the weather indicates a snowy day with gloomy skies, the code will display 'Winter Season' and similarly, for other seasons. To implement this, we'll use the Roboflow Scene Classification API.

We’ll run our inferences on a few different input videos (Video 1, Video 2, Video 3) from the internet, you can use the same or any video you’d like to test out!

Step 1: Install

Install all the needed dependencies as shown below.

pip install roboflow opencv-contrib-python

Step 2: Import and Initialize

Import all the needed libraries and initialize the hosted Roboflow Inference endpoint server, your Roboflow API Key, and the path to your video file.

import requests
import base64
from PIL import Image, ImageDraw, ImageFont
from io import BytesIO
import cv2
import numpy as np

# API endpoint for inference
INFERENCE_ENDPOINT = "https://infer.roboflow.com"
# Your API key
API_KEY = "Your API Key"
# Input video file
VIDEO = "Path/To/Video file"

Step 3: Initialize Your Classes 

Initialize the prompt with the names of seasons. This is a list of categories into which each video frame will be classified.

# List of prompts for season classification
prompts = [
    "Winter Season",
    "Autumn Season",
    "Spring Season",
    "something else"
]

Step 4: Create a Function

In this step, we will create a function to encode the video frame in the proper format, create a payload for the API request, post the request to the Roboflow Web API endpoint, and return the predictions that we get back. The predictions will be the categories (prompts) we initialized in Step 3.

# Function to classify an image using Roboflow's API
def classify_image(image: str) -> str:
    # Convert the image array to a PIL Image object
    image_data = Image.fromarray(image)
    buffer = BytesIO()
    # Save the image data to a buffer in JPEG format
    image_data.save(buffer, format="JPEG")
    # Encode the image data to base64
    image_data = base64.b64encode(buffer.getvalue()).decode("utf-8")

    # Create a payload for the API request
    payload = {
        "api_key": API_KEY,
        "subject": {
            "type": "base64",
            "value": image_data
        },
        "prompt": prompts,
    }

    # Make a POST request to the API endpoint
    data = requests.post(INFERENCE_ENDPOINT + "/clip/compare?api_key=" + API_KEY, json=payload)
    # Parse the response JSON
    response = data.json()

    # Get the index of the prompt with the highest prediction
    highest_prediction_index = response["similarity"].index(max(response["similarity"]))
    # Return the corresponding prompt
    return prompts[highest_prediction_index]

Step 5: Open the Input Video

Open the input video file using the path we initialized earlier.

# Open the input video file
cap = cv2.VideoCapture(VIDEO)

Step 6: Inference and Visualize

Here, we will read the video frame by frame and send those frames to the function we made in Step 4. Then, the classified season output will be displayed on the frame.

while cap.isOpened():
    # Read a frame from the video
    ret, frame = cap.read()
    if not ret:
        break

    # Resize the frame to 60% of its original size (play around with values).
    frame = cv2.resize(frame, None, fx=0.6, fy=0.6)

    # Classify the frame using the classify_image function
    label = classify_image(frame)

    # Convert the frame to RGB (PIL format)
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    pil_image = Image.fromarray(frame)
    draw = ImageDraw.Draw(pil_image)
    font = ImageFont.truetype("arial.ttf", 30)

    # Add the detected season label to the frame (play around with the values to get the right color)
    draw.text((10, 10), label, font=font, fill=(255, 0, 0))

    # Convert the frame back to BGR (OpenCV format)
    frame = cv2.cvtColor(np.array(pil_image), cv2.COLOR_RGB2BGR)

    # Display the frame
    cv2.imshow('Frame', frame)

    # Wait for the 'q' key to be pressed to exit the loop
    if cv2.waitKey(10) & 0xFF == ord('q'):
        Break

cap.release()
cv2.destroyAllWindows()

The Output

Here are the outputs of the three different videos we tested.

Scene Classification Challenges

Achieving real-time performance for scene classification tasks is tough—it requires powerful computing and efficient coding to process visual data quickly and ensure a smooth user experience without delays. Protecting sensitive visual data is crucial, too, calling for strong security measures and responsible data handling to prevent unauthorized access. Plus, staying adaptable is key. Continuously updating models to recognize new scenes and trends helps maintain accuracy over time.

Conclusion

Scene classification can help identify the context and setting of images and videos. It has many different uses, like keeping inappropriate content off platforms or identifying key moments in sports games. We looked at scene classification, the different ways it can be used, and some challenges, like getting it to run smoothly in real-time or protecting people's privacy with visual data. By using reliable scene classification tools like Roboflow's API which uses OpenAI's advanced CLIP model in your projects, you can gain the powerful capabilities of scene classification.

Fuel Your Curiosity

Here are some more resources to satisfy your curiosity and continue your learning journey: