As newer fields emerge within data science and the research is still hard to grasp, sometimes it’s best to talk to the experts and pioneers of the field. Recently, we spoke with Alex Ratner, co-founder and CEO at Snorkel AI, and an Assistant Professor of Computer Science at the University of Washington ahead of his upcoming ODSC West 2022 talk on data-centric AI. You can listen to the full Lightning Interview here, and read the transcript for the first few questions with Alex Ratner below.
Q: What is data-centric AI?
Alex Ratner: A lot of the focus that we’ve looked at has been around what often blocks teams, which is the training data set – these big label data sets that the model fits to learn from. The goal of a standard supervised machine learning model is to fit to but not overfit a bunch of labeled data. What we mean by data-centric AI is that it’s a hypothesis that the average developer is going to need to spend more time, and you’re going to get more leverage out of iterating on the data than iterating on the model. It’s not strictly an either-or, it’s kind of a thesis that you’re going to get more leverage out of improving your data or your data set labeling more data, improving the quality, improving the way that you’re slicing or sampling or augmenting the data, etc, rather than picking a slightly different transformer model.
This all credits the way that the model development has just taken off over the last couple of years. But most often today that’s where you get the leverage. So it’s a hypothesis about that, and it’s just a set of techniques that help you do this development of the data now. For us, we do largely focus on the training data sets, but I think it’s very exciting as an area to explore because there’s so much to explore in this area now that you’re looking at testing and evaluating data sets.
Thinking about other aspects of how the data is developed and processed is super interesting. My co-founder and former advisor Chris Ray started putting in some notes about test set engineering because a lot of people are used to “oh the test set is something I download from a benchmark and then I just get a score that gets spit out at the end and it tells me out how well I do.” But actually, even engineering the data sets that you’re using for evaluation is hugely important. You need to make sure that it’s sampled in a way that reflects your deployment environment, that it’s refreshed, that it’s monitored, etc. I absolutely think it’s more expansive than just a training set, although that’s often what we focus on just because that’s the kind of zero to one blocker to actually get a model built.
Q: Why is data-centric AI getting so much attention now?
Alex Ratner: I think there’s a technological aspect and then there’s a cultural aspect. I think on the technological side, it really just has to do with the tremendous explosion in progress and convergence around machine learning models. If we start on the academic side. if you went to any kind of ML, NLP, computer vision, or applied conferences and you looked at the average paper, they were about specific models or specific types of features that did really well in specific tasks. If you wanted to solve some problem X in industry, you would have to build a specific, specialized model for that. The model was the blocking point and it was where you had the kind of differentiation and how well your solution worked. Now, we’ve had this tremendous process towards not only models getting more powerful and more automated, but also more convergent. Lately, everything is a transformer, and then we go on to the latest variants. If you look at state-of-the-art results, the models are very very similar across a huge range of applications and data sets.
You’ve had this thing where you start to hold the model fixed and you’re not going to get big gains often in many problems from you know tweaking the model architecture and you may not even be able to in practice because they’re so massively black box. That trend towards everyone is using these more powerful but less modifiable and much more standardized models that are also more data hungry becomes the fact they need all this labeled data, and the interface point to actually change or edit something becomes the data because you can’t really do it with the model as it is fairly fixed. We’ve gone from the data being viewed as fixed and the model being the thing you’re iterating on, so the model is being fixed and the data necessarily becoming the thing that you iterate on.
The technological aspect is because of this this big kind of driver of model progress and convergence but leading to much more data-hungry and less editable models. The data has to become the interface where you’re both stuck on but where you actually develop things. Then, we’re trying to catch up and build the tool sets and the platforms that support this new central pain point and workflow.
The cultural aspect is also just because data engineering is so often all parts of the pipeline, whatever kind of way you slice it up and name it, and is almost always kind of thrown under the rug and treated like a second-class citizen in ML development.
The fancy ML models or the things that we get taught about in our courses seem really exciting – at least up until the last couple of years – and all the fancy machine learning papers get published about…I think there’s an exciting sense of relief from many practitioners that dealing with the data is once again getting recognized as important and worthy of support and study and formal formalization.
This idea that data is important is not totally new. I love an example from a decade plus ago that people were working on the old school version of language models, they had all these sophisticated models that they built, and then Google came up with one called “stupid backoff,” and just trained it on 10x or 100x the amount of data and it blew all the other ones away.
This idea that you can beat a fancier model with just more and better data is not fundamentally new, but the pendulum has swung back given these recent trends in the environment to that being the reality and now we’re trying to get to build around it.
More on Alex Ratner’s ODSC West Session on Data-Centric AI:
Data-centric AI broadly describes the idea that *data*, rather than models, is increasingly the crux of success or failure in AI for many settings and use cases. More specifically, data-centric AI defines ML development workflows that center around principally iterating on the *training data*–e.g. labeling, sampling, slicing, augmenting, etc.–rather than the model architecture. In this talk, I’ll describe how programmatic or weak supervision can not only facilitate these data-centric workflows (in ways that manual labeling cannot), but more importantly, will present an overview of how it can serve as an API for rich organizational knowledge sources, presenting recent technical results and user case studies.