The current utility and accessibility of machine learning is in part due to the exponential increase in the availability of data over time. While data is abundant, labels that are required for specific supervised machine learning tasks can be difficult to obtain. At ODSC West in 2018, Dr. Jennifer Prendki gave an introduction to active learning, a technique which can be used to minimize the time and cost required to build a suitable dataset for supervised learning. Dr. Prendki is currently the VP of machine learning at Figure Eight and has a wealth of experience from a variety of data science roles.
[Related Article: An Overview of Proxy-label Approaches for Semi-supervised Learning]
Labeling all the data available can be cost prohibitive despite the multitude of services that offer human labeling. Dr. Prendki offers two solutions; label faster using machine learning and label smarter to maximize the accuracy gain per label.
It may seem like circular logic to use machine learning models to label data for machine learning, but Dr. Prendki explains that a human-model partnership can be used to develop an effective cycle. A variety of models and services exist to label images rapidly, but their accuracy rate is far from perfect. To insure accuracy, humans then review the labels rapidly and correct any erroneous records. The model can then be retrained with the newly labeled data. This process starts with enough human generated labels from which a model can be trained. The model is then used to label an additional subset of the remaining data, imitating the human labeler. A percentage of the model-labeled data points will be incorrect, so a human will be required to relabel a portion of the automated labeled data. The loop continues with retraining the model on the labeled data and correcting erroneous labels until the model reaches sufficient accuracy or all the data is labeled.
Dr. Prendki introduces the idea of labeling smarter by maximizing the information to data volume ratio. Selective training can reduce the cost of labeling by processing less data and increasing model accuracy by insuring that the model learns from data points key to generalizing beyond the training data. Randomly sampling training data is often the best practice, but in some cases, selectively sampling data will ultimately provide the best results and will reduce the number of data points needed to be labeled. For example, in the figure below, a model trained by selectively sampling and labeling the blue data points among others, one can build a better model than simply randomly sampling data.
The concept of smart labeling is intuitive if one is familiar with how to deal with imbalanced classes in machine learning. If the predictive accuracy of undersampled classes is of importance, it is key to focus labeling effort to those classes. However, sample imbalance is not the only way to determine which data points offer the most information gain during training. Dr. Prendki suggested that the change in entropy, and the model’s confidence in the prediction for each data point are key sources of information to determine the most informative records.
[Related Article: Trends in AI: Towards Learning Systems That Require Less Annotation]
- Active learning (AL) is a semi-supervised approach, in that it leverages both labeled and unlabeled data.
- AL can minimize the amount of time and resources spent to label a dataset for a unique task
- AL can improve prediction accuracy by insuring that the training dataset maximizes the information gain to data volume ratio