Paolo is a speaker for ODSC East 2020 this April 13-17. Be sure to check out his talk, “Guided Labeling: Human-in-the-Loop Label Generation with Active Learning and Weak Supervision,” there!
One of the key challenges of utilizing supervised machine learning for real-world use cases is that most algorithms and models require lots of data with labels. Such labels will be used as the target variable in the training of your predictive model.
How do we efficiently improve the labeling process to save money and time then and get the labels we need? Well we can use active learning combined with weak supervision in a guided analytics interactive application.
In active learning settings, the human is placed back in the loop and helps guide the algorithm. The idea is simple: not all examples are equally valuable for learning, so the process picks the examples using active learning sampling and the human provides the labels for these, so that the algorithm can learn from them. This cycle (or loop) continues until the learned model converges or the user decides to quit the application.
Active learning sampling is the selection of unlabeled data points at the real core of the active learning strategy. We are going to use two strategies to do our active sampling: label density for exploring the feature space and model uncertainty for exploiting data points near the decision boundary.
Label density is about comparing the prior distribution with one of the already labeled data points. When labeling data points the user might wonder “is this data point representative of the distribution?” and “are there still many other data points quite similar to this one I just labeled? How do I skip them?”. Label density can address these concerns by ranking the unlabeled data points to give a feeling of how a data point is representative of the unlabeled distribution. Furthermore, label density also ranks penalizing data points that are too similar to what has already been labeled in past iterations.
Model Uncertainty is based on the prediction probabilities of the model on the sill unlabeled data points. Besides selecting data points based on the overall distribution, we should also prioritize missing labels based on the attached model predictions. In every iteration, we can score the data that still needs to be labeled with the retrained model. What can we infer given those predictions by the constantly re-trained model? One of the things we can do is to compute the uncertainty of the model for each prediction. Uncertainty gives a feeling of where the model needs human input.
Beside active learning, you can also use weak supervision, a revolutionary technique where a model can be trained by providing a good set of labeling functions. Adopting weak supervision speeds, even more, the generation of labels but it’s even more powerful if interactively controlled via a user interface!
Join my presentation at ODSC East 2020 to learn how I could combine active learning and weak supervision in a single guided analytics application using the free and open-source tool KNIME Analytics Platform.
About the ODSC East Speaker/Author:
Paolo Tamagnini is a data scientist at KNIME, holds a master’s degree in data science from the Sapienza University of Rome and has research experience from NYU in data visualization techniques for machine learning interpretability. Follow Paolo on LinkedIn.