Using transfer learning to initialize your computer vision model from pre-trained weights rather than starting from scratch (initializing randomly) has been shown to increase performance and decrease training time. It makes sense, by giving your model prior knowledge about basic concepts like lines, curves, textures, and "things" it should be able to more quickly learn about the specific objects of interest in a custom dataset.

But, I've always been curious: does transfer learning always produce better results than a randomly initialized training run? What if the base domain is from an entirely different domain? Is it more important that my starting checkpoint was trained on similar images or that it was trained on a huge number of images?

I decided to run an experiment to find out.

A video summary of this post.

The Task

I decided to use a Mask Wearing dataset as my test subject and observe the results of using different starting checkpoints on model performance.

The final result of the best model.

The Setup

I decided to use YOLOv5 as the model architecture for these tests. It has been shown to generalize well and has support for transfer learning. But most importantly, it's easy to use so I was able to train all the models I needed for this test in a single day. I followed this tutorial to train YOLOv5 on a custom dataset to train both the starting points and the final models.

I used the YOLOv5s model size with the default settings, including a 640x640 input size and the built-in augmentations. The models were set to train until the loss on the validation set did not improve for 50 consecutive epochs (in practice this turned out to be about 1 hour on a V100 GPU and there was little variance between the training runs).

The Transfer Learning Starting Points

I decided to train four models from four different checkpoints on the Mask Wearing dataset and compare their performance. The starting points I chose were:

  • Randomly Initialized Weights - to get a baseline without using transfer learning. This is the default if you don't pass any weights to use as a starting checkpoint.
  • Microsoft COCO - this is the industry standard for transfer learning; it's trained on millions of photos containing a wide variety of common objects. Notably, faces is not one of the classes in COCO so, while it has broad prior knowledge of general concepts, it does not have any knowledge particularly relevant for our mask detection task.
  • WIDER FACE - a dataset of 16,000 images with faces labeled. I specifically chose this dataset because it seemed very similar to the mask wearing task. Our model will first need to find faces to determine if they're wearing masks. I was particularly interested in whether the better specificity of prior knowledge outweighed the fact that it has seen 100x less images than the COCO trained model.
  • BCCD - a dataset of blood cell images from a microscope. This dataset was chosen as a particularly perverse example: it is about as far away from face mask detection as possible. By including it, I wanted to see if transfer learning was always helpful (or at the very least neutral) or if it could sometimes degrade performance by locking a model with its prior biases.

The Results

The results on the held-back test set are as follows:

Starting Point Starting mAP mAP Precision Recall
Random N/A 76.9% 33.1% 84.7%
COCO 55.8% 83.6% 50.8% 90.0%
WIDER Face 65.6% 87.5% 64.3% 88.3%
BCCD 90.9% 75.9% 41.6% 83.0%

It can be hard to visualize what this means in practice, so here is an example prediction from each trained model on two images from the testing set (visualized at 50% confidence):

In these examples, you can see that the "from scratch", "bccd", and "coco" starting-point models missed some masks completely (the profiled person facing to the right in the first photo and the bccd model missed the person completely in the second photo) while the WIDER Face starting-point model did better.

Interestingly, there are some examples where the COCO model performed better; notably, the COCO based model identifies a mask in this photo where a man is wearing a mask that covers his entire face and the WIDER Face based model does not:


The checkpoint you choose to start from for transfer learning does affect the quality of your final model. Choosing a more closely aligned starting point to your problem produces better results (even if it has learned from fewer example images), and choosing a poor starting point can be worse than using randomly initialized weights (but probably not by much).

Luckily, Roboflow Train makes it easy to experiment with different starting checkpoints for your models! You can use a number of models pre-trained on datasets like COCO or previous versions of your own models as a starting point. Try it out today.