Train Test Split Guide and Overview
In order to ensure our models are generalizing well (rather than memorizing training data), it is best practice to create a train, test split. That is, absent rigor, our models can easily overfit to a small subset of examples we've collected. Look no further than Tesla using computer vision to identify stop signs – there is significantly more variation than one would anticipate.
By default, Roboflow prompts users to create train, valid, and test splits at the time of upload to encourage model building best practices. The default settings split a user's data into a 70 / 20 / 10 split: 70 percent of the examples are in the training set, 20 percent are in the validation set, and 10 percent are held out in the testing set.
However, there may be times where you seek greater control over exactly which images are in your training, validation, or testing set. In fact, Andrej Kapathy of Tesla spends as much time on test set curation as training set curation.
Adjusting splits in Roboflow is simple. When uploading data, a user can select which split the images in the current upload should be in the training, validation, or testing set.
Once we've added images to one split in our dataset, we can select "Add More Images" to repeat the upload process, except we may select "Validation" or "Testing" for our next batch of uploaded images.
As a bonus, if your images happen to be organized in Train
, Valid
, and Test
folders locally and you drop these folders into Roboflow at upload, Roboflow will automatically detect this file structure organization at the time of upload.
Be sure to refer to the Roboflow documentation for additional tips!