Machine learning algorithms are exceptionally data-hungry, requiring thousands – if not millions – of examples to make informed decisions. Providing high quality training data for our algorithms to learn is an expensive task. Active learning is one strategy that optimizes the human labor required to build effective machine learning systems.

Active Learning Defined (Simply)

Active learning is a machine learning training strategy that enables an algorithm to proactively identify subsets of training data that may most efficiently improve performance. More simply – active learning is a strategy for identifying which specific examples in our training data can best improve model performance. It's a subset human-in-the-loop machine learning strategies for improving our models by getting the most from our training data.

An Example of Active Learning in Practice

Imagine you are building a computer vision model to identify packages left on your doorstep in order to send you a push alert that you have received mail. Packages come in a wide array of shapes, sizes, and colors. Say your training data contains 10,000 example images: 600 images of packages that are brown cardboard boxes, 300 images of flat and puffy white envelopes, and 100 images of yellow boxes.

Moreover, each of these boxes could be left on your porch in a different spot, the weather could vary the brightness and darkness on a given day, and the boxes themselves can vary in size. An ideal dataset will have a healthy diversity of variation to capture all possible circumstances: a small yellow box on the left side of your porch during the day to a white envelope on the the right side on a cloudy day (and everything in between, in theory).

We'll say you want to develop a model quickly yet accurately, so you err on the side of labeling a 1,000 images subset of your full 10,000 images, assessing model performance, and then labeling additional data. (Did you know that you can label images in Roboflow?)

Which 1,000 images should you label first?

Intuitively, you'd opt to label some number of each of our hypothetical package classes: brown boxes, white envelopes, and yellow boxes. Because our model cannot learn what, say, a brown box is without seeing any examples of it, we need to be sure to include some number of brown boxes in our training data. Our model would go from zero percent mean average precision on the box class to some non-zero mAP on the brown box class once having examples.

The intuitive decision to ensure our first 1,000 images include examples from each class we want our model to learn is, in effect, one type of active learning. We've narrowed our training data to a set of images that will best improve model performance.

This same concept extends to other attributes of the boxes in our image dataset. What if we selected 300 brown box examples, but every brown box example was a perfect cube? What if we selected all 100 yellow box examples, but our yellow boxes were always delivered on sunny, bright days? Again, our goal should be to create a dataset that includes variation in each of our classes so our model can best learn about packages delivered of any size, in any position, and in any weather condition.

Types of Active Learning

There are multiple methods by which we could aid our model's learning from our initial training data. These methods may include generating additional examples from our training data or determining which subset of examples are most useful to our model. At our core, we're aiming to identify which samples (subset) from a full population (our training data) can best aid model performance.

Sampling which examples are most helpful to our model can follow a few strategies: pool-based sampling, stream-based selective sampling, and membership query synthesis.

All active learning techniques rely on us leveraging some number of examples with ground truth, accurate labels. That is, we inevitably need to human label a number of examples. The active learning techniques differ in how we use these known, accurate examples to identify additional unknown, useful examples from our dataset.

To evaluate these different active learning techniques, pretend we have some "active learning budget" we can spend on paying to perfectly label examples from our package dataset. Now, given all active learning techniques require we already have some number of examples with accurate labels, let's pretend we were given 1,000 perfectly labeled examples to start. We'll also pretend we have 1,000 "active learning credits" in our budget to spend on getting the perfect labels for the an additional 1,000 unlabeled package images.

Bounding boxes on packages labeled for package detection and computer vision
An example of the "free" labeled packages we have available. (Source: Packages Dataset)

Thus, the question is: how should we spend our budget to identify the next most helpful examples to improve our model's performance.

Let's break each active learning technique down, and how they may spend our budget on our theoretical package example.

Pool-Based Sampling

Pool-based sampling is an active learning technique wherein we identify the "information usefulness" of all given examples, and then select the top N examples for training our model. Said another way, we want to identify which examples are most helpful to our model and then include the best ones.

Pool-based sampling is perhaps the most common technique in active learning, though it is fairly memory intensive.

To relate this to how we spend our "active learning cash budget," consider the following strategy. Remember, we already have an allotment of 1000 perfectly human labeled package examples. We could train a model on 800 labeled examples and validate on the remaining 200 labeled examples. With this model, we could assume that the model's lowest predicted precision examples are the ones that are going to be most helpful to improving performance. That is, the places where our model was most wrong in the validation set may be images that are most useful for future training.

So how do we spend our pretend 1,000 active learning credits with pool-based sample? We could run our model on the remaining 9,000 package images. We'll then rank our images by lowest predicted label probabilities: images where no packages were found (0 percent precision) or packages were found with low confidence (say, sub 50 percent precision). We'll spend our 1,000 active learning credits on the 1,000 lowest ranking examples to magically have perfect labels for them. (We may also want to evaluate our mAP per class to select examples weighted towards lower performing package classes.)

Voila! We've optimized our active learning budget to work on the images that will best improve our model.

Stream-Based Selective Sampling

Stream-based selective sampling is an active learning technique where, as the model is training, the active learning system determines whether to query for the perfect ground truth label or assign the model-predicted label based on some threshold. More simply, for each unlabeled example, the active learning agent says, "Do I feel confident enough to assign the model to this myself, or should I ask for the answer key for this example?"

Stream-based selective sampling can be exhaustive search – as each unlabeled example is examined one-by-one – but it can cause us to blow through our active learning budget depending on if the model queries for the true ground truth label too many times.

Let's consider what this looks like in practice for our packages dataset and 1,000 credit active learning budget. Again, remember we start with a model that we've trained on our first 1,000 perfectly labeled examples. Stream-based active learning would then go through the remaining 9,000 examples in our dataset one-by-one and, based on a confidence threshold of predicted labels, the active learning system would decide whether or not to spend one of our active learning credits for a perfectly labeled example for that given image.

But wait a minute: what if our active learner requests labels for more than 1,000 examples? This is exactly the disadvantage of stream-based selective sampling: we may go beyond our budget if our model determines that more than 1,000 examples require a true ground truth label (relative to our confidence threshold). Alternatively, if our budget is truly capped at 1,000, our stream-based sample may not make its way through the full remaining 9,000 examples – only some random first N that stayed within our budget. (This is undesired as we have not optimized the spend of our active learning budget against unlabeled examples.)

Membership Query Synthesis

Membership query synthesis is an active learning technique wherein our active learning agent is able to create its own examples based on our training examples for maximally effective learning. In a sense, the active learning agent determines it may be most useful to, say, create a subset of an image in our labeled training data, and use that newly created subset image for additional training.

Membership query synthesis can be highly effective when our starting training dataset is particularly small. However, it may not always be feasible to generate more examples. Fortunately, in computer vision, this is where data augmentation can make a huge difference.

For example, pretend our package dataset has a shortage of yellow boxes delivered on dark and cloudy days. We could use brightness augmentation to simulate images in lower light conditions. Or, alternatively, imagine many of our brown box images are always on the left side of the porch and close to our camera. We could simulate zooming out or randomly crop the image to alter perspective in our training data.

We could relate this to spending our 1,000 active learning credits by saying we spend a credit every time we generate a new image. Notably, we would need an intelligent way to determine which types of images should be generated. This is where examining a Dataset Health Check of our data would go a long way: we may want to over index on generating yellow box images as, in our example, we only had 100 of them to start. Moreover, knowing the general position of our bounding boxes in the images would help us realize if our delivered packages skew towards one part of our porch.

Applying Active Learning on Your Datasets

Fortunately, many of the active learning techniques reviewed here are not mutually exclusive. For example, we can both create a "information usefulness" score of all of our images as well as generate additional images that best assist in model performance.

As is the case in most machine learning problems, having a clear understanding of your data and being able to rapidly prototype a new model based on added ground truth labels is essential. Roboflow is purpose-built for understanding your computer vision dataset (class balance, annotation counts, bounding box positions), conducting image augmentation, rapidly training computer vision models, and evaluating mean average precision on validation vs test sets as well as by class. Roboflow even makes it easy to continue to collect images from your deployed production conditions with an image upload API so you can, e.g., programmatically add new examples from real inference conditions.

As always, happy building!