We typically meet an organization’s data science team after they’ve carried out a successful proof of concept. The algorithm they built or acquired produced results that were promising enough to greenlight development of a production ML system.
It’s at this point that the immaturity of ML project management often comes to the fore. In our experience, there are five obstacles that dim the glow of the POC experience.
1. The raw data they’ll use to train the algorithm is not in a usable form
Most of our clients have their own data. In fact, they have lots and lots of data. The problem is that their data disorganized, not clean, and often is stored in a variety of formats and locations.
Data science teams deal with this all the time, and it is a well-documented source of their professional dissatisfaction. Projects often stall here, as the team is forced to get their data into a form that allows it to be enriched as training data.
2. The team’s training data preparation strategy won’t produce enough data
Data scientists fully understand how much training data their project needs. What they underestimate is what it takes to produce that much training data.
In our first meeting with a data science team, it’s quite common to hear them say, “We’ve burned through most of our budget, we’re way behind schedule and our model isn’t close to the confidence level we need.” The team is spending every spare minute labeling and annotating data themselves, usually with inappropriate technology, and there’s no end in sight.
3. The team lacks production software project skills
No surprise here. Data scientists spend years learning how to build algorithms and how to work with data. They aren’t taught anything about project management, task design, workflow curation, or data QA strategy, let alone about standing up production software.
In part this reflects the immaturity of ML ops as a discipline. It also reflects the scarcity of experienced ML software engineers and other more project-focused roles. Still, this is small comfort to the data science team that is trying to perform functions that are outside of their training or experience.
4. The Team Lacks the Infrastructure to Ensure Training Data Quality at Scale
Small sample sizes and loose confidence criteria make data quality less of an issue during a proof of concept. But when the goal is to stand up a system that offers a return on investment if not more, the stakes get higher. Teams need much, much more training data, and suddenly quality is of paramount importance. And data at this scale can’t be manually checked for accuracy.
Preparing training data for real requires a sophisticated quality assurance plan, with considerations of multiple passes, consensus decisionmaking, gold data insertions, and so on. And it requires a technology platform that provides a real-time view of data quality and worker performance.
5. There’s Bias in the Training Data
As countless breathless headlines attest, training data bias is another obstacle to delivering a production ML system. There are several types of bias, each of which can have a different effect on model performance and reliability.
The sources of these biases, and the techniques for mitigating them, are well understood. All social scientists learn the appropriate techniques. But while data scientists understand perfectly well the implications of training data bias, they aren’t typically trained to detect or correct for it.
If after reading this you conclude that your organization, too, suffers from ML ops immaturity, you aren’t alone. And you have many options for lifting yourself out of that state. We’ve prepared a Blueprint for Preparing ML Training Data that offers a detailed, checklist approach to getting yourself ready to stand up a production ML system.