At Roboflow, we often get asked, what is the train, validation, test split and why do I need it?

The train, validation, test split visualized in Roboflow

The motivation is quite simple: you should separate you data into train, validation, and test splits to prevent your model from overfitting and to accurately evaluate your model. The practice is more nuanced...

Let's dive in!

What is Overfitting in Computer Vision?

When you are training a computer vision model, you are showing your model example images to learn from. In order to guide your model to convergence, your model uses a loss function to inform the model how close or far away it is from making the correct prediction. The model formulates a prediction function based on the loss function, mapping the pixels in the image to an output.

The modeling process in PP-YOLO

The danger in the training process is that your model may overfit to the training set, that is, the model might learn an overly specific function that performs well on your training data, but does not generalize to images it has never seen.

This is a 2D example.

An example of a model tightly fitting a function based on training data

If your model hyper specifies to the training set during training you will see training loss curves that show the training loss continually decreasing with validation loss continuing to increase.

An example of overfitting during training

This means that your model is over memorizing on the training set and not generalizing to images that it has not seen in the validation set.

The train, validation, and testing splits are built to combat overfitting.

The Training Set

The training set the largest corpus of your dataset that you reserve for training your model. After training, inference on these images will be taken with a grain of salt, since the model has already had a chance to look at and memorize the correct output.

For a default, we recommend allocating 70% of your dataset to the training set.

The Validation Set

The validation set is a separate section of your dataset that you will use during training to get a sense of how well your model is doing on images that are not being used in training.

During training, it is common to report validation metrics continually after each training epoch such as validation mAP, or validation loss. You use these metrics to get a sense of when your model has hit the best performance it can reach on your validation set. You may choose to cease training at this point, a process called "early stopping".

As you work on your model, you can continually iterate on your dataset, image augmentations, and model design to increase your model's performance on the validation set.

We recommend holding out 20% of your dataset for the validation set.

The Test Set

After all of the training experiments have concluded, you probably have gotten a sense on how your model might do on the validation set. But it is important to remember that the validation set metrics may have influenced you during the creation of the model, and in this sense you might, as a designer, overfit the new model to the validation set.

Because the validation set is heavily used in model creation, it is important to hold back a completely separate stronghold of data - the test set. You can run evaluation metrics on the test set at the very end of your project, to get a sense of how well your model will do in production.  

We recommend an allocation of 10% of your dataset to the test set.

How Train Validation and Test Relate to Preprocessing and Augmentation

Naturally, the concept of train, validation, and test influences the way you should process your data as you are getting ready for training and deployment of your computer vision model.

Preprocessing steps are image transformations that are used to standardize your dataset across all three splits. Examples include static cropping your images, or gray scaling them. All preprocessing steps are applied to train, validation, and test.

Image augmentations are used to increase the size of your training set by making slight alterations to your training images. These occur only to the training set and should not be used during evaluation procedures. For evaluation, you want to use the ground truth images, residing in the validation and test sets.

Common Pitfalls in the Train, Validation, Test Split

Here are some common pitfalls to avoid when separating your images into train, validation and test.

Train Test Bleed

Train Test bleed is when your some of your testing images are overly similar to your training images. For example, if you have duplicate images in your dataset, you want to make sure that these do not enter different train, validation, test splits, since their presence will bias your evaluation metrics. Thankfully, Roboflow automatically removes duplicates during the upload process, so you can put most of these thoughts to the side.

Overemphasis on the Training Set

The more data, the better the model. This mantra might tempt you to use most of your dataset for the training set and only to hold out 10% or so for validation and test. Skimping on your validation and test sets, however, could cloud your evaluation metrics with a limited subsample, and lead you to choose a suboptimal model.

Overemphasis on Validation and Test Set Metrics

At the end of the day, the validation and test set metrics are only as good as the data underlying them, and may not be fully representative of how well you model will perform in production. That said, you should use them as a guide post, pushing your models performance and robustness ever higher.

Conclusion

In this post, we have discussed the train, validation, test splits and why they can help us prevent model overfitting by choosing the model that will do best in production. We have also discussed some common pitfalls in the creation and use of train, validation, and test splits and how you can avoid them.

As always, happy training.