How to Use Albumentations for Computer Vision: Step By Step

Imagine you're training a computer vision model to detect animals in wildlife images. Your dataset looks promising, zebras in the savannah, lions resting under trees, and elephants marching in the dust. But when you test the model on a blurry photo of a zebra taken at dusk, it completely fails. Why? Because your model only saw clean, well-lit images during training.

This is where Albumentations steps in like a seasoned wildlife photographer adding noise, changing brightness, flipping angles, and even cropping scenes so your model learns to survive in the wild, not just in the lab.

In this guide, we’ll explore how Albumentations turns plain datasets into powerful training goldmines covering everything from simple image flips to complex 3D medical scans. And just wait until you see how it augments videos consistently across frames, a game-changer for video detection tasks. In this blog we will cover:

Overview of Albumentations and its role in computer vision
Why image augmentation improves model performance
How to install and start using Albumentations
Using pixel, spatial, and 3D transforms for augmentation
Building and customizing augmentation pipelines
Practical examples for detection, segmentation, pose, and video
Benchmarking Albumentations vs. other libraries
Fast, no-code augmentation with Roboflow
Tips to speed up and debug augmentation pipelines

Let’s dive in.

What Is Albumentations?

Albumentations is a fast, flexible, and user-friendly image augmentation library for computer vision tasks. It was designed to work seamlessly with deep learning frameworks like PyTorch, TensorFlow, and Keras, and is widely used in computer vision tasks such as classification, object detection, segmentation, and keypoint estimation.

The main goal of Albumentations is to increase the diversity of training data without actually collecting more data, thereby improving model generalization. It achieves this by applying various transformations (called augmentations) to images and their corresponding labels (bounding boxes, masks, keypoints etc.) in a way that maintains consistency and accuracy.

Augmentation using Albumentations library

What Is Image Augmentation in Computer Vision?

In the context of computer vision and deep learning, image augmentation refers to the process of creating modified versions of images in the training dataset using various transformations. These transformations simulate the kinds of variations a model might encounter in the real world like changes in lighting, angle, scale, or background.

The goal of image augmentation is to:

Increase the diversity of training data
Prevent overfitting
Improve model generalization

Why Do We Need Augmentation?

Imagine training a model to detect cats, but your dataset only has images of cats sitting upright in well-lit rooms. If the model is then shown an image of a cat lying down in dim lighting, it might fail to recognize it.

By adding training images with dim lighting or darker conditions, we help the model learn to recognize important features, like a cat’s face or body, even if the cat looks different or appears in different places in such lightning conditions.

Getting Started with Albumentations

Let’s see Albumentations library in action.

Installation

Before you can use albumenations library, you need to install it. The installation is easy. You can install install the latest version from PyPI with following command.

pip install -U albumentations

💡

Albumentations requires Python 3.9 or higher.

If you are using Anaconda or Miniconda, you can install it with following command.

conda install -c conda-forge albumentations

Alternatively you can install latest version from GitHub. The following command installs the bleeding-edge version directly from the main branch.

pip install -U git+https://github.com/albumentations-team/albumentations

💡

Installing from the main branch may provide access to the latest features, but it can be less stable compared to official release versions.

Types of Augmentations in Albumentations

Albumentations offers various types of image augmentations, grouped into the following main categories.

Pixel-level transforms
Spatial-level transforms
3D Transforms

The code examples discussed in this article are available in full within the companion notebook. I recommend following along there for hands-on experimentation with all examples.

Pixel-Level Transforms

In Albumentations, pixel-level transforms are a type of image augmentation that modifies only the pixel values of an image, leaving associated targets like masks, bounding boxes, or keypoints unchanged. These transforms focus on altering the image's appearance without affecting its spatial structure or the geometric relationships within the image.

Key characteristics of pixel-level transforms:

Image-only Modifications: They primarily affect the pixel values, such as brightness, contrast, color, or adding noise.
Geometry Transformation: They do not involve geometric transformations like rotation, scaling, or shearing.
Target Independence: Because they don't alter spatial relationships, these transforms don't require updating any associated targets like masks, bounding boxes, or keypoints.

Examples of pixel-level transforms in Albumentations:

💡

Albumentations provide wide range of pixel-level transforms. Here we cover the few examples of pixel-level transforms that are useful for augmenting dataset for training computer vision models.

Blur: It applies a simple box filter to the image using OpenCV's cv2.blur, with the blur intensity controlled by a randomly selected kernel size within the specified range. A larger kernel produces a stronger blur effect.

transform = A.Blur(
    blur_limit=(7, 13), # optional. Use larger kernel for stronger blur
    p=1.0)  
aug = transform(image=img)["image"]

You should see output similar to following.

Image Blur

ChannelShuffle: Randomly shuffles (rearranges) the channels ([R, G, B]) of the image. E.g. [R, G, B] may become [G, B, R] or [B, R, G]. This results in color shift, but it preserves the original image data.

transform = A.ChannelShuffle(p=1.0)
aug = transform(image=img)["image"]

You should see output similar to following.

Channel Shuffle

CLAHE: Applies Contrast Limited Adaptive Histogram Equalization to improve local contrast by dividing the image into small tiles and enhancing each region independently. The clip_limit controls how much contrast is boosted, while tile_grid_size sets the number of tiles used for localized adjustments.

transform = A.CLAHE(
        clip_limit=4.0, # Single value or range [1, clip_limit] to control the contrast enhancement.
        tile_grid_size=(8, 8), # number of tiles in the row and column directions.
        p=1.0)
aug = transform(image=img)["image"]

You should see output similar to following.

CLAHE

GaussianBlur: Applies a Gaussian filter to the image using a randomly chosen kernel size and sigma value, creating a smooth, soft-focus effect by reducing image noise and fine details. The blur intensity is controlled by blur_limit (kernel size) and sigma_limit(spread of the blur).

transform = A.GaussianBlur(
    sigma_limit=(3.0, 7.0),  # Small sigma eg. (0.2, 0.5) for subtle blur larger (3.0, 7.0) for stronger blur
    blur_limit=(9, 9),       # nxn for fixed kernel size, 0 for auto-compute kernel size
    p=1.0)
aug = transform(image=img)["image"]

You should see output similar to following.

Gaussian Blur

GaussNoise: Adds random Gaussian noise to the image, with noise strength controlled by the std_range (standard deviation) and mean_range, expressed as fractions of the image’s max value. The noise can vary per channel and be scaled spatially for speed using noise_scale_factor.

Example using noise standard deviation as factor.

transform = A.GaussNoise(
    std_range=(0.3, 0.6), # range between [0, 1] for noise standard deviation as a fraction
    p=1.0
    )
aug = transform(image=img)["image"]

Following is the output.

Gaussian noise (standard deviation)

Example using noise mean as factor.

transform = A.GaussNoise(
    mean_range=(0.3, 0.6), # range between [0, 1] for noise mean as a fraction
    p=1.0
    )
aug = transform(image=img)["image"]

Output is similar to following.

Gaussian noise (mean)

HueSaturationValue: Randomly adjusts the hue, saturation, and brightness (value) of an image by modifying its HSV color space representation. The amount of change for each channel is controlled by hue_shift_limit, sat_shift_limit, and val_shift_limit, allowing varied color and lighting effects.

transform = A.HueSaturationValue(
    hue_shift_limit=20, # single value or renge between [-180, 180] to change Hue
    sat_shift_limit=30, # single value or renge between [-255, 255] to change Saturation
    val_shift_limit=20, # single value or renge between [-255, 255] to change Brightness
    p=1.0 # Probability of applying the transform.
    )
aug = transform(image=img)["image"]

Here's the output.

HSV

Normalize: It changes the pixel values of an image to a standard scale. Most images have pixel values in the range [0, 255]. But this range can vary in brightness, contrast, and color distribution. Normalization transforms these values into a new, more stable range like [0, 1] or centered around 0. Albumentations supports several normalization methods, controlled by the `normalization` parameter.

standard (Default): Uses provided mean and std (e.g., ImageNet values) in formula (img - mean * max_pixel_value) / (std * max_pixel_value)
image: Computes global mean and std from the image itself (all channels together).
image_per_channel: Computes mean and std separately for each channel (R, G, B). Adjusts each channel independently.
min_max: Scales all pixels to [0, 1] range using global minimum and maximum pixel values
min_max_per_channel: Scales each channel separately to [0, 1].

transform = A.Normalize(
    mean=(0.485, 0.456, 0.406),
    std=(0.229, 0.224, 0.225),
    max_pixel_value=255.0,
    p=1.0)

aug = transform(image=img)["image"]

Following will be the output.

Normalize

RandomBrightnessContrast: Randomly modifies the brightness and contrast of an image by scaling and shifting pixel values. The brightness_limit controls how much lighter or darker the image becomes, and the contrast_limit adjusts how strong or flat the contrast is. The effect can be based on the maximum pixel value or the mean, depending on brightness_by_max.

transform = A.RandomBrightnessContrast(
    brightness_limit=0.3, # Factor for changing brightness, single value or range [-1.0, 1.0]
    contrast_limit=0.3, # Factor for changing contrast, single value or range [-1.0, 1.0]
    p=1.0)
aug = transform(image=img)["image"]

Following will be the output.

Random Brightness Contrast

Sharpen: Enhances the edges and fine details in an image using either a kernel-based method (Laplacian operator) or a Gaussian interpolation method. The alpha parameter controls the strength of sharpening, while lightness adjusts brightness (only in the kernel method). You can choose between kernel for more intense sharpening or gaussian for a smoother, more natural effect.

Example of kernel-based method.

transform = A.Sharpen(alpha=(0.2, 0.5),
          lightness=(0.8, 1.0),
          method='kernel',
          p=1.0)
aug = transform(image=img)["image"]

Following will be the output.

Sharpen using kernel based method

Example of Gaussian interpolation method.

transform = A.Sharpen(alpha=(0.5, 1.0),
        method='gaussian',
        kernel_size=5,
        sigma=1.0,
        p=1.0)
aug = transform(image=img)["image"]

Following will be the output.

Sharpen using gaussian interpolation

ToGray converts a color image to grayscale using one of several methods, such as:

weighted_average: Converts using a weighted sum of R, G, B based on human brightness perception.
from_lab: Extracts the lightness channel from the LAB color space.
desaturation: Averages the max and min pixel values across channels.
average: Computes a simple average of all color channels.
max: Uses the maximum value among all channels for each pixel.
pca: Applies PCA to reduce image to grayscale based on dominant variance.

The parameter controls how the grayscale is computed, ranging from fast approximations to perceptually accurate conversions. If num_output_channels is greater than 1, the grayscale output is duplicated across multiple channels (e.g., 3-channel grayscale).

Why replicate the channel?

While a true grayscale image only needs one channel, you might replicate the channel for compatibility reasons or use the output with algorithms or models that expect a certain number of input channels (like 3 channels for an RGB image). In such cases, having the same grayscale data in multiple channels ensures the data fits the expected input format without adding extra information. So to avoid shape mismatch errors, ToGray replicates the grayscale channel to produce a 3-channel grayscale (by default) image.

Example of gray scale conversion using from_lab method.

transform = A.ToGray(
        method="desaturation", # method used for grayscale conversion
        p=1.0
)
aug = transform(image=img)["image"]

You should see output similar to following.

Gray (Desaturation)

Following is the example of gray scale conversion using desaturation method and number of output channel is 1.

transform = A.ToGray(
        method="desaturation", # method used for grayscale conversion
        num_output_channels=1, # number of output channels. Default is 3
        p=1.0
)
aug = transform(image=img)["image"]

You should see output similar to following.

Gray with desaturation and channel 1

Spatial-Level Transforms

In Albumentations, spatial-level transforms are augmentations that change the geometric structure such as shape, position, or orientation of an image and its related targets, such as masks, bounding boxes, or keypoints. These transforms are useful for helping models learn to handle different object placements and perspectives. It affect both the input image and any associated targets to ensure consistency between the image and its annotations.

Key characteristics of spatial-level transforms:

Geometry Transformation: These transforms modify the layout of the image, such as flipping, rotating, or shifting it.
Target Association: Since the image's structure is changed, any associated targets (like masks or bounding boxes) is also transformed in the same way to stay accurate.
Error Sensitive: If a spatial transform is applied to an unsupported target type, it may raise an error, so you must ensure compatibility.
Usability: Spatial transforms are especially important for object detection, segmentation, or keypoint tasks where the position and shape of objects matter.
Integration: Spatial transforms can be used together with pixel-level transforms to both change the geometry and appearance of images for stronger data augmentation.

💡

Albumentations provide wide range of spatial-level transforms. Here we cover the few examples of spatial-level transforms that are commonly used for augmenting dataset for training computer vision models.

Note: The following are the basic examples of spatial-level transforms in Albumentations without using any target (bounding box, mask, keypoints). In order to augment images along with its targets (such as bounding box and keypoints), the albumentations pipeline (albumentations.Compose()) is used because it allows to specify additional information using parameters like bbox_params, keypoint_params which is required while augmenting images with targets. We will discuss how to use pipelines and see examples of augmenting image along with it’s target in later section. In this section we will only focus on augmenting images without it’s targets (i.e. augmenting for training image classification model).

Affine: Affine applies a combination of geometric transformations to an image, including:

Translation: moves the image left/right or up/down
Rotation: spins the image around its center
Scaling: zooms in or out
Shearing: slants the image like turning a rectangle into a parallelogram

These transformations help simulate real-world camera effects and are commonly used for data augmentation in computer vision.

The key parameters to affine are:

scale: Zooms the image in or out; accepts float, tuple, or dict for independent x/y scaling.
rotate: Rotates the image by degrees around the center.
rotate_method: Determines how rotated bounding boxes are handled.
shear: Slants the image (like tilting a square into a trapezoid).
translate_percent: Moves the image by a percentage of its width/height.
translate_px: Same as translate_percent, but values are in pixels instead of percentages.
fil: value for filling image borders (e.g., 0 = black).
fill_mask: fill value for segmentation masks.
border_mode: OpenCV strategy to fill borders.
fit_output: Resizes output canvas so no part of the image is cut off after rotation.
keep_ratio: Maintains the aspect ratio during scaling.
interpolation: Uses OpenCV interpolation flags to set how pixel values are computed during transformation.
mask_interpolation: Same as interpolation, but used for masks.
balanced_scale: Ensures equal chance of zooming in and zooming out.
p: Probability of applying the transform.

transform = A.Affine(
    scale=(0.8, 1.2),
    rotate=(-30, 30),
    shear={"x": (-10, 10), "y": (-5, 5)},
    translate_percent={"x": (-0.1, 0.1), "y": (-0.1, 0.1)},
    interpolation=1,  # cv2.INTER_LINEAR
    fit_output=True,
    fill=0,
    p=1.0
)
aug = transform(image=img)["image"]

The output of the code will be similar to following.

Affine

Crop: extract a specific rectangular region from an image (and optionally its mask, bounding boxes, and keypoints) by specifying the corner coordinates.

x_min, y_min: Top-left corner of the crop box.
x_max, y_max: Bottom-right corner of the crop box.
pad_if_needed: If True, pads the image instead of throwing an error when the crop size exceeds the image.
pad_position: Where to add padding if needed (center, top-left, random, etc.).
border_mode: OpenCV method for handling padding borders (e.g., constant color, reflect, replicate).
fill: Color/value used for padding the image when
border_mode:Uses OpenCV border mode for padding.
fill_mask: Value used for padding the mask, if present.
p: Probability of applying the crop; usually set to 1.0 to always apply.

transform = A.Crop(
    x_min=310,
    y_min=210,
    x_max=790,
    y_max=710,
    p=1
)
aug = transform(image=img)["image"]

Following is the output.

Crop

HorizontalFlip: Flips the image left to right (mirror image) around the vertical (y) axis.

transform = A.HorizontalFlip(
    p=1.0 # Always apply for this example
    )  
aug = transform(image=img)["image"]

Here's the output.

Horizontal Flip

VerticalFlip: Flips the image top to bottom (upside down) around the horizontal (x) axis.

transform = A.VerticalFlip(
    p=1.0 # Always apply for this example
    )
aug = transform(image=img)["image"]

Here's the output.

Vertical Flip

Rotate: Randomly rotates the image by an angle picked from a specified range. It can also rotate masks, bounding boxes, and keypoints accordingly.

Key parameters used are:

limit: Range of rotation angles (in degrees).
limit: (-90, 90) means the image can rotate randomly between -90° and 90°. If a single float like 30 is given, it means (-30, 30).
interpolation: Method used to estimate pixel values after rotation (e.g., cv2.INTER_LINEAR for smooth results).
border_mode: Defines how to fill in areas that fall outside the original image after rotation (e.g., cv2.BORDER_CONSTANT, cv2.BORDER_REFLECT).
rotate_method: How to rotate bounding boxes. `largest_box` fits the rotated object into the smallest upright box. 'ellipse' uses ellipse fitting (good for object tracking).
crop_border: If True, removes the outer border that might appear after rotation. The output size may change.
mask_interpolation: Like interpolation, but for masks (usually cv2.INTER_NEAREST).
fill: Pixel value used for empty areas after rotation if border_mode=cv2.BORDER_CONSTANT.
fill_mask: Fill value for empty mask areas (if applicable).
p: Probability of applying the rotation. Default is 0.5.

transform = A.Rotate(
    limit=45,                   # rotate randomly between -45° and 45°
    interpolation=1,            # cv2.INTER_LINEAR
    border_mode=0,              # cv2.BORDER_CONSTANT
    fill=0,                     # fill empty space with black
    p=1.0
)
aug = transform(image=img)["image"]

Output is similar to following.

Rotate

Resize: Resizes the input image (and optional mask, bounding boxes, etc.) to a fixed width and height, regardless of the original aspect ratio.

Key parameters used are:

height: Target height in pixels for the output image.
width: Target width in pixels for the output image.
interpolation: Algorithm used to resize the image. Uses OpenCV interpolation flags as the value (e.g. cv2.INTER_LINEAR, cv2. INTER_AREA)
mask_interpolation: Same as interpolation, but for masks.
area_for_downscale: Automatically switches to cv2.INTER_AREA when shrinking images. The values are image to applies it only to images or image_mask to apply it to both image and mask.
p: Probability of applying the transform (default 1.0, always apply).

transform = A.Resize(
    height=224,
    width=224,
    interpolation=cv2.INTER_LINEAR,
    mask_interpolation=cv2.INTER_NEAREST,
    area_for_downscale="image",
    p=1.0
)
aug = transform(image=img)["image"]

Output is similar to following.

Resize

RandomScale: Randomly resizes the image by scaling it up or down using a randomly chosen scale factor. Unlike Resize, it does not keep the original size, so the output dimensions can vary.

Key parameters used are:

scale_limit: Range of scale factors to apply. If scale_limit = 0.1, scale is randomly picked from [0.9, 1.1]. If the value is a tuple like (-0.2, 0.3), it samples from [0.8, 1.3].
interpolation: Method used to compute resized pixel values using OpenCV interpolation flags as value (e.g., cv2.INTER_LINEAR which is also the default).
mask_interpolation: Same as above but applied to masks (e.g., for segmentation tasks). Default is cv2.INTER_NEAREST.
area_for_downscale: Values can be image or image_mask and switches to cv2.INTER_AREA when shrinking, which gives better quality for downscaling.
p: Probability of applying the transform. Default: 0.5.

transform = A.RandomScale(
    scale_limit=0.2,                  # scale randomly between 0.8x and 1.2x
    interpolation=cv2.INTER_LINEAR,
    mask_interpolation=cv2.INTER_NEAREST,
    area_for_downscale="image",
    p=1.0
)
aug = transform(image=img)["image"]

Here's the output.

Random Scale

3D Transforms

In Albumentations, 3D transforms are a special type of augmentation designed for volumetric data, such as CT scans, MRI images, or any 3D medical imaging. Unlike regular 2D augmentations that work on flat images (height × width), 3D transforms work on three-dimensional data which includes depth as an additional dimension.

Key characteristics of 3D transforms:

3D Data: These transforms handle data in the form of (D, H, W) or (D, H, W, C), where:
- D = depth (number of slices or layers)
- H = height
- W = width
- C = channels (e.g., 1 for grayscale, 3 for RGB)
Target Association: When applying these transforms, both the input 3D volume and its associated 3D mask (used for segmentation) or keypoints are transformed together to stay aligned.
Dimension Consistency: Albumentations ensures that the same transformation (like cropping or flipping) is applied uniformly across depth, height, and width. This is important to maintain the spatial and anatomical structure in medical data.
Usability: 3D transforms are mainly used in medical image analysis, where preserving 3D structure is crucial for tasks like tumor segmentation or organ localization.
Mask3D Format: The corresponding segmentation labels should be in the shape (D, H, W), where each slice in the volume has a matching mask.

The following are the examples of 3D transforms in Albumentations.

For these examples you need 3D volume data that you can download from BraTS2020 Dataset.

CenterCrop3D: Crops a 3D volume (like a medical scan or video clip) from its center to a specified depth, height, and width. It extracts a central cube or cuboid from a 3D volume. This is similar to 2D center cropping but extended to 3D data with dimensions:

Depth (number of slices or frames),
Height (image rows),
Width (image columns).

Key parameters used are:

size: Desired output crop size in (depth, height, width). Required.
pad_if_needed: If True, pads the volume with a constant value if it's smaller than the crop size.
fill: Pixel value used to pad the image if padding is applied. Default: 0.
fill_mask: Value used to pad the mask if padding is needed. Default: 0.
p: Probability of applying the transform. Default: 1.0.

# Load 3D volume and mask
volume = nib.load("/content/BraTS20_Training_001_flair.nii").get_fdata().astype(np.float32)
mask = nib.load("/content/BraTS20_Training_001_seg.nii").get_fdata().astype(np.uint8)


# Define the CenterCrop3D transform with mask support
transform = A.CenterCrop3D(
    size=(64, 128, 128),  # (depth, height, width)
    pad_if_needed=True,
    fill=0,
    fill_mask=0,
    p=1.0
)

# Apply the transform to both volume and mask
augmented = transform(volume=volume, mask3d=mask)
vol_aug = augmented["volume"]
mask_aug = augmented["mask3d"]

The output is like following.

Center Crop 3D

CoarseDropout3D: Randomly removes (sets to a constant value) several cuboid-shaped regions within a 3D volume (and optionally its mask). This simulates occlusion, sensor noise, or missing regions that are common in real-world 3D scans.

Key parameters used are:

num_holes_range: Minimum and maximum number of cuboids to drop in each volume. Example: (2, 5) drops 2 to 5 cuboids randomly.
hole_depth_range: Depth (z-dimension) of each dropout cuboid as a fraction of volume depth.
hole_height_range: Height (y-dimension) of each dropout cuboid as a fraction of volume height.
hole_width_range: Width (x-dimension) of each dropout cuboid as a fraction of volume width.
fill: Value to fill dropped-out voxels. Can be a single number or a tuple for multi-channel data.
fill_mask: Value to fill corresponding regions in the mask. If None, the mask stays untouched.
p: Probability of applying this transform to a volume.

volume = nib.load("/content/BraTS20_Training_001_flair.nii").get_fdata().astype(np.float32)
mask = nib.load("/content/BraTS20_Training_001_seg.nii").get_fdata().astype(np.uint8)


# Define the CoarseDropout3D transform
transform = A.CoarseDropout3D(
    num_holes_range=(2, 6),
    hole_depth_range=(0.1, 0.2),
    hole_height_range=(0.1, 0.2),
    hole_width_range=(0.1, 0.2),
    p=1.0
)

# Apply the transform to volume and mask
augmented = transform(volume=volume, mask3d=mask)
vol_aug = augmented["volume"]
mask_aug = augmented["mask3d"]

Following will be the output.

Coarse Dropout 3D

Pad3D: Add padding to 3D volumes, like medical scans, along the depth, height, and width dimensions. Padding adds extra voxels (3D pixels) on the sides of a volume and is often used to ensure a consistent shape across all samples or to prevent cropping of features during convolution.

Key parameters used are:

padding: Specifies how much to pad each dimension. int pads all 6 sides (front/back, top/bottom, left/right) equally. tuple[int, int, int] pads symmetrically on all sides of each dimension (e.g. depth, height, width). tuple[int, int, int, int, int, int] allows different padding values for each side explicitly e.g. (depth_front, depth_back, height_top, height_bottom, width_left, width_right)
fill (float or tuple): The value used to fill the padded voxels in the image volume. Default is 0.
fill_mask (float or tuple): Fill value used for the mask (if provided). Default is 0.
p (float): Probability of applying the padding. Default is 1.0 (always apply).

# Load 3D volume and mask
volume = nib.load("/content/BraTS20_Training_001_flair.nii").get_fdata().astype(np.float32)
mask = nib.load("/content/BraTS20_Training_001_seg.nii").get_fdata().astype(np.uint8)

# Apply Pad3D transformation
transform = A.Pad3D(
    padding=(5, 5, 5),  # Pad depth, height, width by 5 on each side
    fill=225,
    fill_mask=1,
    p=1.0
)

augmented = transform(volume=volume, mask3d=mask)
vol_aug = augmented["volume"]
mask_aug = augmented["mask3d"]

Following is the output.

Pad3D

RandomCrop3D: Randomly crops a sub-volume of the specified (depth, height, width) from a 3D image. If the original volume is smaller than the crop size, it can optionally pad the volume before cropping.

Key parameters used are:

size (tuple[int, int, int]): The output crop size specified as (depth, height, width).
pad_if_needed (bool): Default is False, if set to True, the volume will be padded (using fill or fill_mask) when it's smaller than the requested crop size.
fill (float or tuple): Default 0, the value used to pad the image if needed.
fill_mask (float or tuple): Default is 0, the value used to pad the mask if needed.
p (float): Default is 1.0, the probability that this transform will be applied.

# Load 3D volume and mask
volume = nib.load("BraTS20_Training_001_flair.nii").get_fdata().astype(np.float32)
mask = nib.load("BraTS20_Training_001_seg.nii").get_fdata().astype(np.uint8)

# Define transform
transform = A.RandomCrop3D(
    size=(16, 128, 128), 
    pad_if_needed=True, 
    fill=0, 
    fill_mask=0, 
    p=1.0
)

# Apply transform
augmented = transform(volume=volume, mask3d=mask)
vol_aug = augmented["volume"]
mask_aug = augmented["mask3d"]

Output will be similar to following.

Random Crop 3D

Albumentations Pipelines

An Albumentations pipeline is a structured way to apply a sequence of image augmentations using the Albumentations library. The pipeline makes it easy to define and combine multiple image transformations such as flipping, rotating, blurring, brightness adjustment, and more in a single, reusable block. When applied, these transformations help artificially expand the dataset by creating varied versions of the original images. This process, known as data augmentation, is especially important in training deep learning models, as it improves their ability to generalize by exposing them to diverse visual patterns.

Albumentations pipelines are widely used in tasks like image classification, object detection, and segmentation, and support not only images but also associated data like masks, bounding boxes, and keypoints. These pipelines are created using albumentations.Compose(). Here's the syntax:

pipeline = albumentations.Compose(
    transforms,                # List of augmentations (required)
    bbox_params=None,          # For bounding boxes (optional)
    keypoint_params=None,      # For keypoints like facial landmarks (optional)
    additional_targets=None,   # For handling multiple images/masks (optional)
    p=1.0,                     # Chance to run the whole pipeline (1.0 = always)
    is_check_shapes=True,      # Checks if image/mask shapes match
    strict=False,              # Strict validation for unknown inputs
    mask_interpolation=None,   # How to resize masks (if needed)
    seed=None,                 # Fix randomness (for reproducible results)
    save_applied_params=False  # Track which transforms were actually used
)

The pipeline has following parameters:

transforms: Transforms specifies a list of image augmentations like albumentations.Flip(), albumentations.Blur() etc. This is required argument.
p: The probability p defines how likely the entire pipeline is to run. Use 1.0 to always run. (Not to be confused with the p of individual transforms.)
bbox_params: Needed if you want to apply transforms to bounding boxes.
keypoint_params: Use this if you're working with keypoints (e.g. eye/nose positions in face images).
additional_targets: Helps when you want to augment more than one image or mask at the same time (like 'image2': 'image').
is_check_shapes: If True, Albumentations checks that all inputs (image, mask, etc.) are the same size. Helps avoid shape mismatch errors.
strict: If True, Albumentations will raise an error if you pass anything unexpected. Set it to False while debugging.
mask_interpolation: If you’re resizing masks, this controls the method used. Leave as None unless you need precise control.
seed: Fixes randomness. Use this if you want the same results every time you run the pipeline.
save_applied_params: If True, you can print out which transforms were actually used during augmentation. Very helpful for debugging.

Albumentations offers different methods to build flexible pipelines depending on how and when you want the transforms to be applied. Here are four common methods used to build and customize augmentation pipelines:

Basic Pipeline: Applies a list of augmentations one after another in a fixed order.
Dynamic Modification: Allows you to add or remove transforms from an existing pipeline using + and - operators.
Composition Utilities: Provides special wrappers like OneOf, SomeOf, Sequential, and RandomOrder to control how multiple transforms are applied.
Parameter Configuration: Lets you define settings for bounding boxes, keypoints, additional targets, and random seeds to handle complex input data.

Let’s see how to work with each of these methods:

Basic Pipeline

To apply a sequence of augmentations to an image, `albumentations.Compose([transform1, transform2, ...])` is used. Each transformation is applied in the order you define. For example, we define an Albumentations pipeline that applies following three image augmentations to an image

Horizontal Flip: With a 50% chance, the image may be flipped left to right.
Random Brightness and Contrast: With a 70% chance, the brightness and contrast of the image may be adjusted to make it brighter/darker or increase/reduce contrast.
Gaussian Blur: With a 30% chance, a blur effect may be added to soften the image details.

Each transformation is applied independently based on its probability. When the pipeline is run, the result is a randomly augmented version of the original image, and the final output is displayed.

# Load image
image = cv2.imread("raccoon.png")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Define pipeline
pipeline = A.Compose([
    A.HorizontalFlip(p=0.5),             # 50% chance to flip
    A.RandomBrightnessContrast(p=0.7),   # 70% chance to adjust brightness/contrast
    A.GaussianBlur(p=0.3)                # 30% chance to apply blur
])

# Apply pipeline
result = pipeline(image=image)
aug_image = result["image"]

You will see following output.

Albumentations Pipeline

The pipeline does not always apply all augmentations, each transformation is applied independently based on its own probability (p value):

Here’s what your pipeline means:

pipeline = A.Compose([
    A.HorizontalFlip(p=0.5),             # 50% chance to flip
    A.RandomBrightnessContrast(p=0.7),   # 70% chance to adjust brightness/contrast
    A.GaussianBlur(p=0.3)                # 30% chance to apply blur
])

On each run, the pipeline goes through the transforms in order. For each transform, it checks the probability. If the random check passes (e.g., random number < p), the transform is applied. If it fails, the transform is skipped. So, in a single run:

Sometimes only one transform might be applied,
Sometimes two,
Sometimes all three,
Or even none (though rare, since most p values are > 0.3).

If you want all transforms to apply every time, you should set p=1.0 for each transform.

You can track and print which transformations were applied by setting the save_applied_params=True when creating the pipeline. This will return extra metadata about each applied transform, including the name and parameters used. For example, the following code helps you track applied augmentation.

# Define pipeline with applied transform tracking
pipeline = A.Compose([
    A.HorizontalFlip(p=0.5),             # 50% chance to flip
    A.RandomBrightnessContrast(p=0.7),   # 70% chance to adjust brightness/contrast
    A.GaussianBlur(p=0.3)                # 30% chance to apply blur
], save_applied_params=True)

# Apply pipeline
result = pipeline(image=image)
aug_image = result["image"]

The output could be.

Pipeline with transform tracking

Here you see that HorizontalFlip is not applied. You may get different result each time you run the pipeline.

Dynamic Modification: Add or Remove Transforms

Once a pipeline is defined, you can easily add or remove transforms without starting over. For example, you can add the transform at the end of the pipeline using following statement.

pipeline = pipeline + A.Rotate(limit=15, p=0.5)  # Add rotate at the end

To add to the beginning use

pipeline = A.Resize(256, 256) + pipeline  # Resize will now happen first

You can also remove transformation using

pipeline = pipeline - A.GaussianBlur  # Removes the first occurrence of GaussianBlur

This makes it easy to experiment with different combinations of augmentations dynamically.

Composition Utilities: Add Randomness and Control

Albumentations also offers special wrappers to control how transformations are applied:

OneOf: OneOf applies only one transform from the list, randomly selected, each time the pipeline runs. Even if all transforms have p=1.0, only one will be picked and applied.

A.OneOf([
    A.GaussianBlur(p=1.0),
    A.MotionBlur(p=1.0)
], p=0.9)

This has a 90% chance to apply either GaussianBlur or MotionBlur, but not both.

SomeOf: SomeOf applies a few (randomly chosen) transforms from a list, like 1, 2, or 3 out of 4. You can control how many are chosen and the chance of applying each.

A.SomeOf([
    A.GaussianBlur(p=0.5),
    A.Sharpen(p=0.5),
    A.HueSaturationValue(p=0.8)
], n=(1, 2), p=0.9)

This block will:

Run 90% of the time,
Randomly pick 1 or 2 transforms from the list,
And apply only those, based on their own p values.

Sequential: Sequential runs a group of transforms together, either all of them or none, based on one shared probability p.

A.Sequential([
    A.HorizontalFlip(p=1.0),
    A.RandomBrightnessContrast(p=1.0)
], p=0.4)

Here, the whole block has a 40% chance to run. If it runs, both transforms are applied in order. If not, none are applied.

RandomOrder: This shuffles the order in which the listed augmentations are applied every time the pipeline runs. Instead of applying augmentations in a fixed order (top to bottom), RandomOrder randomly rearranges them on each run.

A.RandomOrder([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.8)
])

Here’s how it works:

Sometimes it applies HorizontalFlip first then RandomBrightnessContrast
Other times RandomBrightnessContrast first then HorizontalFlip

The probabilities (p=0.5, p=0.8) still apply individually, but the order changes randomly.

Crop Early: Place cropping transforms like RandomCrop or RandomResizedCrop at the start of the pipeline to reduce the number of pixels subsequent transforms process up to 16× speedup is possible.
Use uint8 images: Albumentations and OpenCV run faster on uint8 images. Keep float conversion only for final Normalize.
Combine transforms: Use A.Affine instead of separate Rotate, Scale, and Translate to minimize overhead.
Use OpenCV or torchvision.io.decode_image: These are significantly faster than PIL/Pillow for image reading.
Avoid OpenCV threading conflicts: In PyTorch DataLoader workers, add cv2.setNumThreads(0) to prevent CPU thread contention.
Offload normalization to GPU: For large batches, apply Albumentations on CPU and perform torchvision.transforms.Normalize() on GPU for better throughput.

Albumentations Augmentation Pipeline Examples

Now that we have learned what is a pipeline, we will see example of using pipeline. We will see how you can augment an image in context of following:

Bounding box augment
Segmentation mask augment
Keypoint augment

Let’s explore now.

Augmenting Images for Object Detection

In object detection, the goal is to identify and locate objects within an image using bounding boxes, which are rectangular regions that tightly enclose the objects of interest. Data augmentation helps improve the robustness and generalization of object detection models by introducing variations in the training data. In this section, we’ll demonstrate how to use Albumentations to build augmentation pipelines that transform images and automatically update the associated bounding box coordinates. This ensures spatial consistency while introducing useful variations like cropping, flipping, resizing, and affine transformations.

# Load the image
img_path = "baseball.png"
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# Load YOLOv8 model and detect people
model = YOLO("yolov8n.pt")
results = model(img)

# Extract 'person' bounding boxes
class_map = model.names
person_id = [k for k, v in class_map.items() if v.lower() == "person"][0]

CONFIDENCE_THRESHOLD = 0.8

bbox_coords = []
class_labels = []
for box in results[0].boxes:
    if int(box.cls) == person_id and float(box.conf) > CONFIDENCE_THRESHOLD:
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        bbox_coords.append([x1, y1, x2, y2])
        class_labels.append("person")

# Albumentations bbox parameters
bbox_params = A.BboxParams(format='pascal_voc', label_fields=['class_labels'])

# Define transforms
transforms = {
    "Original": A.Compose([], bbox_params=bbox_params),
    "Horizontal Flip": A.Compose([A.HorizontalFlip(p=1)], bbox_params=bbox_params),
    "Vertical Flip": A.Compose([A.VerticalFlip(p=1)], bbox_params=bbox_params),
    "Crop": A.Compose([A.Crop(x_min=50, y_min=50, x_max=img.shape[1]-50, y_max=img.shape[0]-50, p=1)], bbox_params=bbox_params),
    "Resize": A.Compose([A.Resize(height=224, width=224, p=1)], bbox_params=bbox_params),
    "Affine": A.Compose([
        A.Affine(scale=(0.8, 1.2), rotate=(-30, 30),
                 shear={"x": (-10, 10), "y": (-5, 5)},
                 translate_percent={"x": (-0.1, 0.1), "y": (-0.1, 0.1)},
                 fit_output=True, p=1)], bbox_params=bbox_params)
}

# Draw boxes
def draw_boxes(image, bboxes):
    for x1, y1, x2, y2 in bboxes:
        cv2.rectangle(image, (int(x1), int(y1)), (int(x2), int(y2)), (255, 0, 0), 2)
    return image

This code demonstrates how to use a YOLOv8 object detection model in combination with the Albumentations library to apply data augmentation on an image and its associated bounding boxes. First, it loads an image (baseball.png). Then, it runs inference using the pre-trained YOLOv8 model (yolov8n.pt) to detect objects in the image. From the detected results, it filters out only those objects classified as "person" and further removes any detections with a confidence score lower than 0.8. The bounding box coordinates and corresponding class labels are collected for augmentation.

Next, it defines a dictionary of augmentation transformations such as horizontal flip, vertical flip, cropping, resizing, and affine transformations. Each transformation is wrapped in an Albumentations pipeline using bbox_params, which is essential to inform Albumentations about the format of the bounding boxes (pascal_voc) and to ensure that each bounding box stays aligned with its class label during the transformation. The code applies each transformation to the image and bounding boxes, then draws the updated boxes on the augmented images. Finally, it visualizes the original and augmented images side by side using Matplotlib, providing an intuitive comparison of how each augmentation modifies both the image and the object locations.

The following will be the output of the above code.

Augmenting images with corresponding bounding boxes

💡

Want to use a model to automatically label data, use Roboflow Instant Models as your Label Assistant.

Why is bbox_params required?

Albumentations needs to know:

The format of your bounding boxes (e.g., pascal_voc, yoloetc.).
How to transform the boxes when an image is resized, flipped, cropped, or otherwise changed.
How to match boxes with class labels (via label_fields=["class_labels"]).

Without bbox_params, Albumentations won’t know that you're passing bounding boxes, and will not transform them, leading to incorrect or unchanged boxes.

Augmenting Images for Semantic Segmentation

In semantic segmentation, the goal is to assign a class label to each pixel in an image, effectively segmenting objects or regions at the pixel level. This is typically done using segmentation masks, where each pixel value represents a class. Data augmentation not only alters the input images but must also apply the exact transformations to their corresponding masks to preserve alignment. In this section, we’ll explore how Albumentations can be used to apply augmentations like resizing, cropping, flipping, and rotation, while ensuring both the image and its mask are transformed consistently for high-quality training data. Here's the code.

img_path = "scene.png"
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
input_img = cv2.resize(img, (520, 520))

# Transform for DeepLabV3
preprocess = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((520, 520)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])
input_tensor = preprocess(input_img).unsqueeze(0)

# Load DeepLabV3 model
model = models.segmentation.deeplabv3_resnet101(weights='DEFAULT'
).eval()

# Run inference
with torch.no_grad():
    output = model(input_tensor)['out'][0]
segmentation_mask = output.argmax(0).byte().cpu().numpy()

# Albumentations transforms
transform_list = {
    "Original": A.Compose([]),
    "Horizontal Flip": A.Compose([A.HorizontalFlip(p=1)]),
    "Vertical Flip": A.Compose([A.VerticalFlip(p=1)]),
    "Crop": A.Compose([A.Crop(x_min=50, y_min=50, x_max=470, y_max=470)]),
    "Resize": A.Compose([A.Resize(height=128, width=128)]),
    "Affine": A.Compose([
        A.Affine(scale=(0.8,1.2), rotate=(-25,25),
                 translate_percent={"x":(-0.1,0.1), "y":(-0.1,0.1)},
                 shear={"x":(-10,10), "y":(-10,10)},
                 fit_output=True, p=1.0)
    ])
}

This code demonstrates how to perform semantic segmentation using the DeepLabV3 model and apply various data augmentation techniques with Albumentations. First, it loads an image (baseball.png) and resizes it to a fixed shape (520×520), then passes it through a pre-trained DeepLabV3 model (with a ResNet-101 backbone) from the PyTorch library. The model outputs a pixel-wise segmentation mask, where each pixel is assigned a class label. Next, the code defines a set of spatial augmentations such as horizontal flip, vertical flip, cropping, resizing, and affine transformations using Albumentations. Each transformation is applied to both the image and its corresponding segmentation mask, ensuring spatial alignment. Finally, the code visualizes the results by displaying the augmented images and their associated segmentation masks in a two-row grid, allowing you to clearly see how each augmentation alters both the input and the predicted label map.

Augmenting images with corresponding masks

Augmenting Images for Pose Estimation

Pose estimation involves identifying and localizing keypoints on objects or human bodies—such as joints, fingertips, or facial landmarks. These keypoints are sensitive to spatial changes, so any transformation applied to the image must also be reflected in the keypoint positions. In this section, we’ll show how to use Albumentations to augment images and accurately update the associated keypoints through transformations like flipping, affine shifts, and scaling. This ensures pose estimation models can learn to handle diverse perspectives and poses.

# Load image
img = cv2.imread("baseball.png")  # replace with your pose image
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# Load YOLOv8 pose model
model = YOLO("yolov8n-pose.pt")
results = model.predict(img)

# Extract keypoints (list of lists) for the first detected person
# keypoints xy coords: results[0].keypoints.xy (list of shape (num_detections, num_kps, 2))
kps = results[0].keypoints.xy.cpu().tolist()
kps = kps[0]  # pick first person
kps_labels = ["kp"] * len(kps)

# Albumentations keypoints handler
kp_params = A.KeypointParams(format='xy', label_fields=['kps_labels'])

# Define augmentations
augments = {
    "Original": A.Compose([], keypoint_params=kp_params),
    "Horizontal Flip": A.Compose([A.HorizontalFlip(p=1.0)], keypoint_params=kp_params),
    "Vertical Flip": A.Compose([A.VerticalFlip(p=1.0)], keypoint_params=kp_params),
    "Crop": A.Compose([A.Crop(x_min=50, y_min=50, x_max=img.shape[1]-50, y_max=img.shape[0]-50, p=1.0)], keypoint_params=kp_params),
    "Resize": A.Compose([A.Resize(256, 256, p=1.0)], keypoint_params=kp_params),
    "Affine": A.Compose([
        A.Affine(scale=(0.8,1.2), rotate=(-25,25), translate_percent={"x":(-0.1,0.1),"y":(-0.1,0.1)},
                 shear={"x":(-10,10),"y":(-10,10)}, fit_output=True, p=1.0)
    ], keypoint_params=kp_params),
}

# Helper to draw keypoints
def draw_kps(img, keypoints):
    out = img.copy()
    for x, y in keypoints:
        cv2.circle(out, (int(x), int(y)), 4, (255, 0, 0), -1)
    return out

This code demonstrates how to combine YOLOv8-pose, a model that detects human body keypoints, with Albumentations to apply data augmentation while preserving the alignment of keypoints. First, it loads an input image (baseball.jpg) and uses a pre-trained YOLOv8-pose model (yolov8n-pose.pt) to detect people and extract their pose keypoints—specific (x, y) locations of body joints like shoulders, elbows, knees, etc.

After detecting keypoints for the first person in the image, the code prepares to augment the image and keypoints together. It sets up the Albumentations pipeline with keypoint_params using the Albumentations keypoints handler kp_params = A.KeypointParams, telling the library how to handle keypoints during transformations and associate each keypoint with a dummy label. Several augmentation pipelines are defined, including horizontal flip, vertical flip, cropping, resizing, and affine transformations. Each transformation is applied to both the image and keypoints simultaneously.

The code then visualizes the results by drawing circles at the keypoint positions on each augmented image. All image versions, original and augmented, are displayed side by side using Matplotlib, allowing you to clearly see how each transformation affects the image and how Albumentations adjusts the keypoints to match the new geometry. This ensures that augmented data remains accurate and useful for training pose estimation models.

Augmenting images with corresponding keypoints

Augmenting Video for Object Detection

In object detection, video augmentation involves applying the same transformation consistently across all frames of a video while maintaining synchronization between the bounding boxes and the image frames.

Albumentations provides approach Flatten Targets +frame_id in label_fields for video annotation. This approach flattens all bounding boxes from all video frames into a single list and uses `frame_indices` in `label_fields` to track which box belongs to which frame. Albumentations applies the same augmentation consistently across all frames and updates the bounding boxes accordingly. After augmentation, the boxes are regrouped per frame using the frame_indices to save updated labels.

Here’s the code to augment video using bounding box information.

# Upload video file
uploaded = files.upload()
video_path = list(uploaded.keys())[0]

# Setup directories
os.makedirs("frames", exist_ok=True)
os.makedirs("labels", exist_ok=True)
os.makedirs("augmented_frames", exist_ok=True)
os.makedirs("augmented_labels", exist_ok=True)

# Load YOLO model (for detecting "car")
model = YOLO("yolov8n.pt")
class_map = model.names
car_class_ids = [k for k, v in class_map.items() if v.lower() == "car"]

# Read video and run YOLO
cap = cv2.VideoCapture(video_path)
frame_w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

video_frames = []
all_bboxes = []
class_labels = []
frame_indices = []

frame_idx = 0
print("Detecting and saving original frames and labels...")
while True:
    ret, frame = cap.read()
    if not ret:
        break

    video_frames.append(frame)
    results = model(frame)[0]

    # Save original frame
    cv2.imwrite(f"frames/frame_{frame_idx:05d}.jpg", frame)

    # Save YOLO label file for current frame
    with open(f"labels/frame_{frame_idx:05d}.txt", "w") as f:
        for box in results.boxes:
            cls_id = int(box.cls.item())
            if cls_id in car_class_ids:
                x1, y1, x2, y2 = map(float, box.xyxy[0])
                cx = (x1 + x2) / 2 / frame_w
                cy = (y1 + y2) / 2 / frame_h
                bw = (x2 - x1) / frame_w
                bh = (y2 - y1) / frame_h

                # Write to original label file
                f.write(f"{cls_id} {cx:.6f} {cy:.6f} {bw:.6f} {bh:.6f}\n")

                # Save for augmentation
                all_bboxes.append([cx, cy, bw, bh])
                class_labels.append(cls_id)
                frame_indices.append(frame_idx)

    frame_idx += 1

cap.release()
video_array = np.array(video_frames)
print(f"Total frames: {len(video_array)}, Total boxes: {len(all_bboxes)}")

# Define video-consistent Albumentations transform
transform = A.Compose([
    A.HorizontalFlip(p=1.0),
    A.RandomBrightnessContrast(p=1.0),
], bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels', 'frame_indices']))

# Apply the augmentation consistently across all frames
print("Applying video-consistent augmentation...")
aug = transform(
    images=video_array,
    bboxes=all_bboxes,
    class_labels=class_labels,
    frame_indices=frame_indices
)

aug_images = aug['images']
aug_bboxes = np.array(aug['bboxes'])
aug_class_labels = np.array(aug['class_labels'])
aug_frame_indices = np.array(aug['frame_indices'])

print("Augmentation complete.")

# Regroup and save augmented frames and labels
print("Saving augmented frames and labels...")
for i in range(len(aug_images)):
    cv2.imwrite(f"augmented_frames/frame_{i:05d}.jpg", aug_images[i])

for idx in range(len(aug_images)):
    mask = aug_frame_indices == idx
    boxes = aug_bboxes[mask]
    labels = aug_class_labels[mask]

    with open(f"augmented_labels/frame_{idx:05d}.txt", "w") as f:
        for (cx, cy, bw, bh), cls in zip(boxes, labels):
            f.write(f"{cls} {cx:.6f} {cy:.6f} {bw:.6f} {bh:.6f}\n")

print("All done! Outputs saved to:")
print("frames/ and labels/ (original)")
print("augmented_frames/ and augmented_labels/ (augmented)")

This code demonstrates how to perform consistent data augmentation on video frames for object detection using a YOLOv8 model and Albumentations. It starts by uploading a video file and extracting its frames. Each frame is saved to a frames/ folder, and the YOLOv8 model detects objects in each frame specifically looking for the class "car". For every detected car, the bounding box is converted into YOLO format (center x, center y, width, height) and saved as a label file in the labels/ directory. All the frames and associated bounding boxes are stored in arrays for augmentation.

Next, a video augmentation pipeline is defined using A.Compose with transformations like horizontal flipping and brightness contrast adjustment. These augmentations are applied simultaneously to all frames using images=..., while ensuring the bounding boxes and frame indices are transformed consistently with the bbox_params.

Finally, the augmented images are saved to the augmented_frames/ directory, and their corresponding transformed bounding boxes are saved as label files in augmented_labels/. The result is two parallel datasets: one original and one augmented, both ready for training object detection models with spatial and temporal consistency.

To verify that the above augmentation happen correctly I have written a code that create two output videos by drawing bounding boxes and class names on both the original and augmented video frames. It reads each frame and its corresponding YOLO-format label file from their respective directories (frames/labels and augmented_frames/augmented_labels), draws the bounding boxes using classid from the label file, and writes the annotated frames into two MP4 videos original_with_boxes.mp4 and augmented_with_boxes.mp4. This visualization step helps compare how bounding boxes change after augmentation. You can find the code in accompanying notebook. The output will be similar to following.

Original video with bounding box

Augmented (Horizontal Flip) video with bounding box

Albumentations Benchmark

The Albumentations benchmark highlights from Albumentations’ official Image Augmentation Benchmarks, comparing its throughput to other popular libraries measured as images processed per second per CPU thread (higher is better). For each transform, Albumentations (v2.0.7) demonstrates high throughput, often surpassing alternatives like imgaug, torchvision, Kornia, and AugLy.

Why Albumentations is so fast?

Optimized backend: Many operations use OpenCV or efficient NumPy, avoiding slow Python loops.
Single-threaded consistency: Benchmarks were strictly measured with one CPU thread, giving reliable comparisons.
Good implementation choices: Selection of algorithms and memory management reduce overhead.

Recommendations

For production pipelines requiring high throughput, Albumentations is an excellent choice.
It’s especially beneficial when you have lots of color/brightness augmentations where others are slower.
For GPU-heavy workflows, consider balancing CPU-based aug (Albumentations) with GPU-based batch normalization (e.g., via Kornia or torchvision).

To test Albumentations speed I have written a code that benchmarks image augmentations across three popular libraries: Torchvision, Albumentations, and Kornia. The benchmark code is available here. The code runs on a high-resolution test image of size 3000x2000 pixels, by loading raccoon3000x2000.jpg. Each augmentation is benchmarked 10 times, and the code outputs:

Average execution time for each transform per library (Torchvision, Albumentations, Kornia).
Visual results showing augmented images with time labels.
A summary report comparing availability, average speed, and performance rankings of each library.

Here’s the output that illustrates that Albumentations is faster than the others.

Torchvision Augmentations

Kornia Augmentations

Albumentations Augmentations

This benchmarking measures the average execution time of each augmentation over multiple runs to ensure consistent performance evaluation. It helps compare the speed of equivalent transforms across Torchvision, Albumentations, and Kornia.

Albumentations for Image Augmentation

Albumentations is a fast, flexible, and easy-to-use image augmentation library designed for computer vision tasks. It offers a rich set of transformations, supports keypoint and bounding box augmentation, and integrates easily with popular frameworks like PyTorch. With its performance-optimized design and wide variety of augmentations, Albumentations is an excellent choice for improving model accuracy and robustness in real-world applications.

Is There a Faster Way?

Roboflow offers a powerful SaaS-based data augmentation engine that allows users to apply a wide variety of image and bounding box augmentations without writing any code or manually implementing tools like Albumentations. This is especially useful for users who want to speed up model training and improve dataset diversity using a visual interface.

Learn more about augment dataset using Roboflow here: Image Augmentation, What is Data Augmentation? The Ultimate Guide and How to Augment Images for Object Detection.

The following are the types of augmentations supported by Roboflow, categorized as Image-Level and Bounding Box-Level augmentations:

Augmentation types in Roboflow

Image-Level Augmentations

These augmentations apply to the entire image, modifying pixel values, orientation, or structure. They are ideal for classification, segmentation, and detection tasks. These are:

Geometric Transformations such as Flip (Horizontal/vertical flip), 90° Rotate, Crop, Rotation, Shear.
Color and Light Transformations such as Grayscale, Hue, Saturation, Brightness, Exposure.
Quality and Noise Augmentations such as Blur, Noise, Cutout, Mosaic.

Bounding Box-Level Augmentations

These augmentations are applied to images with bounding boxes, and Roboflow ensures that the boxes are synchronized with the image so the annotations remain valid.

Geometric Transformations such as Flip (Horizontal/vertical flip), 90° Rotate, Crop, Rotation, Shear.
Appearance-Based Transformations such as Brightness, Exposure, Blur, Noise.

Roboflow simplifies and accelerates the image augmentation process through a no-code interface, enabling users to enrich datasets with a wide range of image-level and bounding box-level transformations. By automating augmentation and maintaining annotation integrity, it helps in enhancing model generalization and reduces manual workload for computer vision tasks. Get started free today.