First, some facts. Fact: active learning is not just another name for reinforcement learning; active learning is not a model; and no, active learning is not deep learning.
What active learning is and why it may be an important component of your next machine learning project was the subject of Jennifer Prendki’s presentation at ODSC London 2018. Prendki is VP of Machine Learning at Figure Eight, previously Crowdflower, a human-in-the-loop machine learning company that’s on a mission “to empower data scientists to train, test, and tune machine learning for a human world”. In other words, they employ an army of mechanical turkers that label data by hand, a costly endeavor from which they — and others with vast stores of unlabeled data — want to get the most bang for their buck, which they do through active learning.
[Related Article: Crash Course: Pool-Based Sampling in Active Learning]
Before getting into the specifics, imagine how prohibitively expensive it would be to hand-classify every Netflix video into genres. Nobody wants to binge that much, and because it would be inefficient for Netflix to pay a human army for classification at that scale, they need to build a classifier that does it automatically. But if bucco bucks are spent on a classifier eventually responsible for automatically classifying every video, they’d want to ensure it’s worth its salt. In other words, the classifier will classify videos well in the real world even if a given movie is a particularly difficult movie to place into a genre. Identifying the movies most likely to trip up the model — and learning from them before it trips — is active learning.
Active learning is the process by which your model chooses the training data it will learn the most from, with the idea being that your model will predict better on your test set with less data if it’s encouraged to pick the samples it wants to learn from.
In general, it works like this:
Train and retrain your model in a series of loops. Before you begin looping:
- train a classifier — logistic regression, random forest, SVM, elastic net, etc. — on a random selection of samples you’ve already hand labeled.
- predict labels for the remainder of your data using this naive model.
In loop 1:
- Identify the samples the model was most uncertain about.
- Get those samples labeled by a human, dubbed the oracle, a.k.a. Mechanical Turkers.
- Retrain your classifier using the initial random samples, plus the samples just labeled by those handy Turkers.
- Predict new labels for the remainder of your data using this now slightly better model.
In loop 2:
- Repeat the process of loop 1
In theory, you continue looping in this way until your model eclipses some threshold for some performance metric you supply an apriori. In practice, however, knowing when to stop is a little trickier, because there isn’t a documented theoretical framework to inform the number of loops, suggesting that the optimal strategy is application-dependent.
Here’s that in a reductive piece of clipart.
Image Credit: Settles, 2009
This is how it works, in general. But there are at least three different frameworks to actively learn in, and another three strategies on how to select samples for labeling, also called querying the oracle. They basically differ in how many samples are queried at a time and how those samples are selected.
One can deploy active learning in a few frameworks. You could:
- Query — or ask the oracle — in batches (pools) of samples. This is the pooling framework.
- Query one sample at a time. This is the streaming framework, getting its name due to how samples are sent to the oracle in streams, piecemeal.
- Query samples that are not in your dataset but are believed to be problem points in your feature space. These are made up. This is called synthetic querying. Prendki doesn’t cover this, so I won’t either.
The bread and butter of active learning is how you’ll select samples within each framework. Options are the following strategies:
- Query samples the model is most uncertain about, where uncertainty is defined by confidence, entropy, or margin. This is called uncertainty sampling. For binary classification, all three of these measures reduce to selecting some N number of rows around the mean prediction. Multiclass classification is a whole other ballgame though.
- Query samples with the most disagreement across multiple models. This is called model by committee.
- Query samples that are going to change the model the most if we knew their labels, where change can be measured by a reduction in error. This is called expected model change. Pendki doesn’t get into the details, so neither will I.
As Prendki points out, the world is the modeler’s oyster, and the strategy you decide to exact can be whatever you want. Say you want to actively learn based on confidence, so to avoid polluting your model with the gunk at the bottom of your distributional barrel, you split your least confident rows into tertiles (or any p-tile). Then grab 90% of each loop’s rows from the bottom third, but the remaining 10% from within 20% of the median band of the top third. That’s confusing in English, but you could do it, because the point is: you can experiment with different query-selection strategies.
So What Should I Try?
Oh come on. You know it depends. Pooling with confidence-based uncertainty sampling is probably the most well-researched approach to active learning in the literature, but as Prendicki points out, that’s not because it’s intrinsically the superior approach. What you decide on, according to Prendki, depends on your application and budget.
When you pool, you know exactly how many samples you’re selecting for labeling by the oracles who demand payment for their oracling, giving you precise finesse over your budget. However, it’s computationally more expensive because you’re retraining on the entire dataset in each loop.
When you stream samples over to your oracle, you reduce computational cost, but budgeting in this environment becomes nebulous as the number of rows you end up sampling is unknown until you finish training. It also doesn’t look at all the data. Passive active sampling, as it were. Plus it’s difficult to set the threshold, unless you have some credible a priori justification, which I find I am typically short of.
- You wish to know the number of data points it would take to train the most accurate classifier possible with as few labeled data points as possible.
- Your data is labeled, but you need to model on a sample of it to reduce computation time. You could choose your best sample via an active learning strategy and just pretend your data is unlabeled and being labeled in each loop.
- You have imbalanced classes in your target variable and want to identify which samples to remove when you downsample. Downsampling is out of the scope of this post, but it’s basically removing samples that are over-represented in your target feature until its classes are approximately equally distributed. Typical passive learning will randomly remove samples, but you could use active learning to retain the samples most helpful to your model.
- The above situations applied to a regression problem. This wasn’t discussed in Prendki’s talk, but know you can use this framework for regression problems.
[Related Article: Automating Data Wrangling – The Next Machine Learning Frontier]
Where Can I Learn More?
- For an overview of the literature on active learning, see Settles, 2009.
- For shorter descriptions, see Data Camp and Quora.
While writing this post, I was remiss that I couldn’t find an example of active learning in the wild, written in R, so stay tuned for a lightweight tutorial on how to build active learning into a classification model to improve accuracy.