Crash Course: Pool-Based Sampling in Active Learning Crash Course: Pool-Based Sampling in Active Learning
Active learning is a class of machine learning problems where labeled data isn’t available for supervised algorithms. Let’s take the classic... Crash Course: Pool-Based Sampling in Active Learning

Active learning is a class of machine learning problems where labeled data isn’t available for supervised algorithms.

Let’s take the classic setup as an example. Say we have pictures of birds and want to classify them by type, but the images don’t have labels for what kind of bird they each contain. In this situation, we rely on humans to annotate the information and tell the machine the type of each example it selects.

For now, we’ll focus on one of the most common active learning problems in the field: pool-based sampling.

What is Pool-Based Sampling?

In every active learning problem, the machine has access to some unlabeled examples, which it then queries an “oracle” for a label on. (Oracle is a common term for the entity identifying the labels, typically a human user). The user updates the parameters and hones in on good configuration based on the machine’s predictions on these labels.

Pool-based sampling is one of many variations on this theme, but also one of the most promising. In pool-based sampling, the machine has access to a large number of examples and samples based on “informativeness.” Informativeness is quantified based on a user-selected metric, which users choose based on the requirements of their application. We’ll look at some of those options briefly, but for now, we have a function that estimates what examples will tell us the most about our problem.

How Does it Work?

We begin by splitting the dataset into our pool and test sets. These should follow the typical 80-20 breakdown seen in most training problems. The pool is then broken out into training and validation sets. We can only select k examples to use in the training set; the rest go to the validation set. We train (ignoring the validation set) then evaluate our model on the test set. So far so good.

This is where things get weird. Instead of getting labels for both the validation and training sets, we only get labels for training set items. There are a couple of reasons for this, the most obvious being that we don’t have to consult our oracle as much for labels. The other reason is we won’t actually use the labels during validation.

In this context, validation is a phase in which our algorithm attempts to predict the labels of the validation examples and outputs a value for how confident it was in its decision. We won’t worry too much about how this confidence is measured — just know it’s a continuous variable that measures decision confidence.

We run our informativeness measure against our validation set and select k more examples from the validation set that we were most uncertain of. Then we remove those examples from the validation set and drop them into the training set. Once they’re in our training set, we ask our oracle for their labels. Afterward, we renormalize our training set with the new information and repeat until we are satisfied with our test performance.

How do we Measure Informativeness?

Informativeness can be quantified as those examples which the model is most uncertain about. So, the examples it has the most difficult time classifying.

One of the most popular ways to handle this problem is using information entropy (IE). IE is a method of determining how much information a particular example from a stochastic source contains. Developers can achieve this by measuring the entropy of a sample relative to all other examples in a dataset.

There are a number of other methods that developers can leverage based on the requirements of their problem, including random and margin sampling. It’s up to the developer to decide which measure to use in their application.

Wish you knew more? At ODSC West 2018, Figure Eight’s VP of Machine Learning Jennifer Prendki will give a quick rundown of active learning. Prendki will go into more depth on different setups for these problems and considerations for each situation.

Spencer Norris, ODSC

Spencer Norris is a data scientist and freelance journalist. He currently works as a contractor and publishes on his blog on Medium: https://medium.com/@spencernorris