Editor’s note: Jonas Mueller is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out his talk, “How to Practice Data-Centric AI and Have AI Improve its Own Dataset,” there!
Machine learning models are only as good as the data they are trained on. Even with the most advanced neural network architectures, if the training data is flawed, the model will suffer. Data issues like label errors, outliers, duplicates, data drift, and low-quality examples significantly hamper model performance.
That’s why data-centric AI techniques are becoming increasingly popular. Rather than solely focusing on model architecture, hyperparameters, and training tricks as the sole drivers of model improvement, data-centric AI utilizes the model itself to systematically improve the dataset (such that a better version of the model can be produced even without any change in the modeling code). Don’t think you have to manually do all of the data curation work yourself! New algorithms/software can help you systematically curate your data via automation.
In this post, I’ll give a high-level overview of how AI/ML can be used to automatically detect various issues common in real-world datasets. These techniques are based on years of research from my team, investigating what sorts of data problems can be detected algorithmically using information from a trained model. To put these ideas into practice, I’ll demonstrate the open-source cleanlab library, which is the most popular data-centric AI software today. With one line of Python code, cleanlab allows you to automatically detect common data issues in almost any dataset (image, text, tabular, audio, etc.) using any machine learning model you’ve already trained (sklearn, huggingface, pytorch, LLMs, …). Once detected, these issues can be addressed to produce a higher-quality dataset and in turn, a more reliable model.
Steps to practice data-centric AI
- Train the initial ML model on the original dataset.
- Utilize this model to diagnose data issues (via techniques covered here) and improve the dataset.
- Train the same model on the improved dataset.
- Try various modeling techniques to further improve performance.
Many data scientists jump from Step 1 → 4, but you may achieve big gains without any change to your modeling code by using data-centric AI techniques based on the information captured by your initial ML model (which already can reveal a lot about the data). Continuously boost performance by iterating Steps 2 → 4 (and try to evaluate with cleaned data).
Another way to improve your dataset is simply to collect more annotations/examples. You’d be surprised how often a smart data scientist’s fancy model they spent weeks optimizing was beaten by somebody using a baseline model who instead just spent a day labeling more data (this is common even within top tech companies). If you properly utilize the information it has captured about your data, your ML model can help decide which data/annotations would be most informative to collect. These techniques help you save limited resources.
Getting Started with Cleanlab
Cleanlab is a Python library built specifically for data-centric AI. With just a few lines of code, you can analyze your dataset to find potential problems.
This simple code runs various algorithms that take in data representations (embeddings) and probabilistic predictions from your ML model and use these to estimate various types of issues that are common in real-world datasets.
Simply detecting data issues doesn’t improve your model – you need to address the problems. For some issues like (near) duplicates, the fix may be as simple as removing the extra copies from the dataset.
For more complex issues like label errors, you can again simply filter out all the auto-detected bad data. For instance, when fine-tuning various LLM models on a text classification task (politeness prediction), this auto-filtering improves LLM performance without any change in the modeling code! Even greater gains can be achieved by correcting the labels of the examples auto-detected to be mislabeled; these gains hold across different LLMs (and more generally across diverse data modalities and ML models).
My ODSC West 2023 Tutorial on Data-Centric AI
To learn more about the underlying data-centric AI techniques and real case studies, come check out my tutorial at ODSC West 2023. I’ll cover:
- Fundamentals of data-centric AI
- Algorithms to automatically detect data issues like label errors and outliers
- Methods to improve datasets, including how to efficiently collect additional annotations.
Through examples and code walkthroughs, you’ll learn exactly how to apply data-centric AI to get the most out of your machine learning projects via techniques you probably never learned in your university courses.
I hope you enjoyed this introduction to cleanlab and data-centric AI. Be sure to check out my talk at ODSC West for a deep dive into these powerful techniques! You can find more details here.
Jonas Mueller is Chief Scientist and Co-Founder at Cleanlab, a software company providing data-centric AI tools to turn unreliable data into reliable models/analytics. Previously, he was a senior scientist at Amazon Web Services developing algorithms that power ML applications at hundreds of the world’s largest companies, and before that completed his PhD in Machine Learning at MIT. He has also helped create the fastest-growing open-source libraries for AutoML and Data-Centric AI.