Active Learning: Why Some Data Are More Equal Than Others and How You Can Use it to Your Advantage Active Learning: Why Some Data Are More Equal Than Others and How You Can Use it to Your Advantage
Artificial Intelligence is a technology that thrives on two kinds of fuel – computing power and data. Their increasing affordability is the driving force... Active Learning: Why Some Data Are More Equal Than Others and How You Can Use it to Your Advantage

Artificial Intelligence is a technology that thrives on two kinds of fuel – computing power and data. Their increasing affordability is the driving force behind the recent AI boom. In fact, many of the key ideas behind artificial neural networks have been around for decades. However, it took democratization of compute resources and the availability of large training sets to enable the kind of rapid progress that we have been witnessing in deep learning. It is true that data has never been so cheap to produce and store — but, one thing remains costly: the data annotation.

Data annotation can take many different forms: sorting photos into piles of dogs vs. cats, selecting which passage in a text contains the response to a given question, or identifying every pixel of an ultrasound image that corresponds to a malignant tumor. One attribute that all of these have in common is that they all require an actual human being to carry out the annotation task. The costs of human labor only go up with time (fortunately for the world we live in!), and to complicate matters further, unlike computing infrastructure, humans do not scale particularly well. On the other hand, the general trend in state-of-the-art deep learning is building deeper and larger networks — which require more data than ever to be trained!

If you cannot label all the data that you have at your disposal, the choice of which subset of training instances to annotate may well be of paramount importance (the fewer labeled instances, the more so). Active Learning is an approach where you enlist the help of the model itself to figure out which instances would be most beneficial to have labelled. It relies on the fact that not all data will be equally useful for your training, and that notion is what we are going to explore in today’s post.

Let’s say you’ve got a cat vs. dog classifier to train. Imagine you are in no shortage of photos, but you are operating on an extremely tight labelling budget. Your goal is to reach a certain level of performance for your model (say, 90% accuracy on a balanced test set), while labelling as few images as possible. How would you go about it?

Well, you might want to start building your model iteratively: first you label some batch of photos (chosen at random from your unlabeled dataset U). Add these to your labeled training set L. Now train the classifier on L and measure its performance on your validation set V. Not good enough? Repeat the steps: move another batch of images from U to L by labelling them, and see how far your model can get to now. Stop once you reach 90% on your validation set: hopefully, before you run out of your labelling budget! Speaking of which, here are a few traps to watch out for to make the most out of your hard-earned annotation money:

Alert # 1: the duplicates

Say, among the photos that you labeled in your first batch, you have this little fellow:


You train the classifier on this batch, and proceed to label the next batch of images for the second iteration. Here is a twist: in real life, you can often find duplicate instances (photos, in this example) inside your dataset. And you might just happen to find two files with the same photo of the orange tabby kitten in your second batch of images to be annotated:








It would do your model absolutely no good to have the duplicate images labeled
again (if anything, it would be giving extra weight to certain training instances, which may not necessarily align with your end goal). So at best, you are throwing money away, and at worst, you are hurting your model’s performance while you are at it. How can this be avoided?


Quite simply, actually! Once you have trained your model on the first batch of images, cute_kittie.jpg included, the model would be well-fitted to that batch, outputting a strong cat prediction for our little orange guy. Since kitten03.png and 101547.jpg correspond to the same input tensor, they too will be classified as cat with the same high degree of confidence. All you have to do to avoid re-labelling duplicates in your dataset is take the trained classifier, use it to get predictions on your unlabeled dataset U and exclude all inputs whose confidence scores are close to 1 from the pool of images to be labelled.

Alert # 2: data augmentation got you covered

Consider a similar scenario, that you also often encounter in real-life problems. The cute_kittie.jpg has been labelled, and its duplicates have been removed from the dataset, but now we start getting images like these:

Look vaguely familiar, don’t they? Sure, they are technically different images, but they are all related by image transforms of some sort (e.g. rotation, translation, and/or zoom). These types of variations we can get for free by making use of data augmentation techniques to artificially increase the size of our training set. Thus, there is no value to be gained by annotating each of the bunch of photos of the orange tabby kitten that are related to each other by a geometric transform. Can we avoid it? Yes, in the exact same way that we got rid of the duplicates! All you have to do is add data augmentation into your data loading pipeline, train the model, and exclude whatever inputs you get a confidence score of close to 1. 

Alert # 3: but we already learned this!

You carry on your training and annotating, happy that you only end up labelling images that are unique in the set. Let us say that you have already labeled hundreds of various orange tabby cats and kittens of all shapes and sizes, and still have more to go:


Should you label these? Sure, you could: it would not hurt, in fact it can only help — but, are there better ways to spend what is left of your labelling budget? If by now your model got a pretty good idea of what constitutes an orange tabby feline, it may be more helpful to provide it with labeled photos of black and white persians and german shepherd puppies — or whichever category of cats or dogs your model has not seen enough of during training. How do we avoid labelling images that the model feels confident about? That’s right, by excluding those that get high confidence scores yet again.

So what should we label first?

So far we saw that there are certain instances, labeling which is not of much use to us. We can exclude them and choose others out of what remains at random, but we can also do better than that. For instance, instead of excluding instances with high confidence scores, we can prioritize those with low scores. Since the model learns from new data at every iteration, the scores will be updated to reflect what the model has learned up to date. You can look at this as the model actively querying you for certain labels, which is where the method’s name comes from 😉

In our case of binary cat vs. dog classification a low (top) confidence score basically means that the model cannot tell whether it has been presented with a picture of a cat or that of a dog.

Can you blame it?

Querying the annotator to label photos like this focuses on the decision boundary between the two classes. In turn, this leads to better performance of the model over the next iterations.

This sounds like a great way to both save on the data annotation costs and get better final accuracy for the classifier. However, as is often the case in the world of machine learning, the reality turns out a little more complicated than that. In addition to duplicates and multiple versions of the same instances, most real-world datasets also contain plenty of noise. For the cat vs. dog classification example, noise may mean photos that contain neither of the two, blurry photos, or just plain noisy images where we cannot make out the contents. 

Both of these would likely get assigned low confidence scores by our classifier-in-training, but are not the best options to have labeled

Thus, in reality, Active Learning is a little more complicated than simply picking out the instances that the model is the least sure about. To find out more about the different Active Learning strategies, take a look at this blog post, and to go in depth into the theory and the PyTorch-filled practice of this approach, come to my ODSC Europe 2020 tutorial Active Learning with a Sprinkle of PyTorch at 11:30 AM BST (GMT +1) on September 18th. See you there!

Image credits:

matthewlarkin.info, alexandrews.picfair.com, instagram.com/gnipoolator, instagram.com/atchoumfan, instagram.com/serjosoza

About the author/ODSC Europe speaker: Olga Petrova is a deep learning R&D engineer at Scaleway, the second-largest french cloud provider. Previously, she received her PhD in theoretical physics from Johns Hopkins University, and spent several years working as a quantum physicist. Olga’s current interests focus on semi-supervised and active machine learning.  On the community side, she enjoys blogging about AI both in and out of working hours. Some of Olga’s writing, including a regular newsletter about the latest advancements in the field of active learning, can be seen on medium.com/@olgapetrova_92798. You can also follow her work on LinkedIn.


ODSC Community

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.