Software 2.0 and Snorkel: Beyond Hand-Labeled Data Software 2.0 and Snorkel: Beyond Hand-Labeled Data
This ODSC West 2018 talk “Software 2.0 and Snorkel: Beyond Hand-Labeled Data,” presented by Alex Ratner, a Ph.D. student in Computer... Software 2.0 and Snorkel: Beyond Hand-Labeled Data

This ODSC West 2018 talk “Software 2.0 and Snorkel: Beyond Hand-Labeled Data,” presented by Alex Ratner, a Ph.D. student in Computer Science at Stanford University, discusses a new way of effectively programming machine learning systems using what’s called “weaker supervision,” and how it enables domain experts who don’t know anything about machine learning to rapidly and flexibly train machine learning models. 

[Related Article: An Introduction to Active Learning]

Ratner also describes Snorkel, a system that focuses on the emerging training data bottleneck in the so-called software 2.0 stack (indicating the move to ML-based systems). A good example of this is how Google’s machine translation group transitioned from its original stats-based, hand-crafted translation system to one based on a large-scale machine learning model implemented in TensorFlow, the company’s open-source AI programming framework. The number of lines of code in the original Google translation system was around 500,000 whereas the number of lines the neural machine translation system was only 500. 

From “Software 2.0 and Snorkel: Beyond Hand-Labeled Data,” presented by Alex Ratner

The talk explains a theory of learning without labeled data, as well as a host of recent applications in natural language processing, structured data problems, and computer vision. The talk briefly discusses recent extensions of these core ideas to automatically generating data augmentations, synthesizing training data, and learning from multi-task supervision.

The presentation is organized in the following way:

  • Snorkel: a Training Data Management System for Software 2.0. Snorkel enables users to quickly and easily label, augment, and structure training datasets by writing programmatic operators rather than labeling and managing data by hand.
  • Current directions: multi-task supervision with Snorkel MeTaL where if you training some model to do say 10 different tasks over similar or the same data, you could conversely train them altogether and share the representations and layers that they learn in some of these networks. 
  • Putting it all together: a vision for Software 2.0

One of the key bottlenecks in building machine learning systems today is creating and managing labeled training data sets. Instead of labeling data by hand, the talk demonstrates how to work on enabling users to interact with the modern ML stack by programmatically building and managing training datasets. These weak supervision approaches can lead to applications built in days or weeks, rather than months or years.

Ratner’s work investigates whether users can train models without any hand-labeled training data, instead writing labeling functions, which programmatically label data using weak supervision strategies like heuristics, knowledge bases, or other models. These labeling functions can have arbitrary accuracies and correlations, leading to new systems, algorithmic, and theoretical challenges.

The talk includes an extensive case study examining the use of Snorkel in the area of chest x-rays – a collaboration with the Stanford Radiology Department. The process used to take years for hand-engineering features for this problem. Using Snorkel, the domain experts were able to define some high-level heuristics for labeling x-ray reports and was able to get to the same baseline in 1-2 weeks. 

From “Software 2.0 and Snorkel: Beyond Hand-Labeled Data,” presented by Alex Ratner

To take a deeper dive into the effort to move beyond hand-labeled data, check out Alex Ratner’s compelling talk from ODSC West 2018.

[Related Article: Deep Learning for Text Classification]

Key Takeaways:

  • By modeling the process of labeling training data sets, we can enable users to generate them at a higher level, faster ways
  • Supervision as the declarative interface to Software 2.0
  • Modeling the noise in these processes rather than viewing them as static assets can allow us to get more performance out of data sets and actually create them in higher-level, cheaper and more efficient ways
  • We can amortize costs and push the boundary even further by taking in multi-task (and massively multi-task) learning

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.