In 2015, a Machine Learning research paper made massive waves discussing the “Hidden Technical Debt in Machine Learning Systems”. In this paper, Sculley et. al. highlighted how the code to build a machine learning model is a really small piece of the entirety of the project. Since then, this notion was validated throughout the industry as Data Scientists and Machine Learning Engineers attempted to productionize models in Jupyter notebooks with little success.
Figure 1: Elements for ML systems. Adapted from Hidden Technical Debt in Machine Learning Systems. Google Cloud, Public Domain.
In fact, most models that still get built today do not make it into production even with the wide range of tools that are popping up on the market. In fact, this domain has widely taken center stage amongst machine learning products in recent years. Although I’m optimistic for the future, a glimpse into the vast and varied landscape of MLOps products today can easily leave you daunted. Mihail Eric wrote a great piece discussing this in his blog post, “MLOps Is a Mess But That’s to be Expected”.
MLOps, or Machine Learning Operations, is a set of tools, practices, techniques, and cultures that ensure reliable and scalable deployment of machine learning systems. This is one of the most, if not the most, important specializations that Data Scientists should adopt for the coming decade because realizing the value of machine learning systems in production is crucial for business success.
Figure 2: The disciplines of MLOps.
Machine Learning Project Lifecycle
The ML Project Lifecycle is immensely iterative. Each step of the way can yield new information or insights that require you to take one to two steps back so you can go three steps forward. This iteration is fundamentally what makes the work so difficult; expert machine learning practitioners are phenomenal at experimentation, but when dealing with rapidly changing information and insights it can be challenging to experiment effectively.
A Machine Learning Project falls into these categories:
Let’s cover each of these at a high level.
In scoping, we create the project and assess if the business is ready to realize value from data. This first starts with having a well-articulated problem statement. Do we really know what we’re trying to solve and whom we’re solving it for? Is it something that actually needs to be solved? Do we fully understand the why and the who behind what we are aspiring to do?
After fully understanding the problem and the people we’re solving it for, it’s important to assess resources, constraints, timelines. Is data actually collected for our use case? It’s great to have moonshot ideas, but machine learning doesn’t happen without data. If data isn’t collected (or maintained with high quality), then that is a significant limitation. It’s better to say no before starting work than spend months working on a “great idea” to realize the data is not collected sufficiently.
It’s usually best to start a project document at this stage to identify key people working on the project, data sources relevant for use, limitations in data quality, expected timelines, potential ethical considerations, and rough timelines for progress. This document should also outline what exactly are you planning to do in this project; is it a recommendation engine, prediction, clustering, etc.? Each has different requirements and should be identified early.
This document should be considered a work-in-progress, especially related to timelines. The highly iterative nature of data work makes it unrealistic to give precise estimations of deadlines that early in the project. In fact, mismanaged data projects primarily miss this point first and cause a plethora of missed deadlines and/or high burnout amongst the data scientists.
The last, but certainly not least, thing to heavily consider is the implementation scenario of your model. Is your solution actually ready to be used by stakeholders? Is it meant to live on a website, app, standalone table, etc.? How are your users meant to interact with your results? Too many projects get started without consideration for this step and should be stopped if a good answer cannot be provided — the value of a machine learning project can only be realized if people use it.
The first part of this stage can overlap a bit with Scoping as you are trying to define data sources and establish baselines for what can be done. This can mean a lot of EDA; hunting, gathering, and visualizing work. It’s imperative to identify quality constraints in this step because if data is not adequately available at a certain level of detail or largely null for a subgroup, this can be a massive problem when you get to the later stages of the project.
One of the worst things I’ve seen people do at this stage is not adequately staying in it for a long enough time. Rushing through this stage or doing low-quality work just to get past it (ie. blindly imputing means to get numerical values) is a seriously detrimental issue. This is ironic because the majority of machine learning projects are composed of data cleaning, but I argue that even then too many shortcuts are taken just to “get things done” and it leads to major issues downstream.
An unfortunate side effect of this stage is it can be very unorganized. It’s a lot of scouring and sourcing high-quality data which can lead to ineffective code or messy queries. As you wrap up the first pass in this stage, I recommend cleaning up your code and organizing your data in sensible areas. Make sure your labels are well defined, data tables created with descriptive names, and documentation provided for how the code can be re-run on new data.
As you’re doing the sourcing, it’s crucial to be mindful of the implementation of your machine learning system. It’s great you can find quality data for training your model, but is it there for new data? How can a system be set up to make sure of that? A balance between fast iteration and robust implementation is difficult to strike, but it is deeply needed. Luckily, you get better at it the more you practice.
The “fun” behind machine learning projects, and what most learn in courses and projects. I won’t cover the depth of selecting and training the right models in this series, as there’s a plethora of that content already out there, but it is crucial to know what you’re choosing and why.
I personally don’t believe you need to know every mathematical depth of models to appropriately use them, but you should be educated on their strengths and weaknesses in-depth to understand why some work and others fail. This leads into the second part of this stage — error analysis.
This stage can be heavily experimental and iterative, and you should have a keen eye to debug where models fail and how to solve those failures (and whether that’s worth your time to do so). Often times model performance is needed to be delicately balanced with business and technology costs (resources, compute, latency, throughput, etc.).
Applied Machine Learning is entering a stage now where choosing a strong model is not really the hard part here, it’s much more difficult to be strategic in experimenting and debugging. It’s an art to know how to try specific strategies or error analysis techniques to understand what your model is learning and where it is failing (appropriately or not). Having a collaborative team can really help here because as people gain more experience, they understand how to conduct effective posthoc analysis of models.
Deployment is not a step that is needed for every single model, but it is needed to build a machine learning system. This is because asking your users or stakeholders to run a Jupyter notebook to see your results will not provide value to them. Usually, models need to be integrated into a website, an app, or some platform that the company already is using and this productionization can be difficult.
Furthermore, the work on this project is not “done” just because you get it in production. There needs to be monitoring and system maintenance set up to ensure the model is performing effectively over time. This is because models are powered by data, and data is not a static resource. It evolves and changes over time which can lead to models deteriorating in production. This can be due to things like data drift (distribution of features changing), concept drift (mapping of features to target changing), data quality issues, etc.
In addition to model monitoring, there are a plethora of software engineering items to set up in this stage:
- Real-time or Batch
- Cloud or Edge/Browser
- CPU/GPU memory
- Latency, throughput
- Security and privacy
From the first deployment to maintenance, these are important considerations to be mindful of and difficult to do right. This is a big reason building machine learning systems can fail so often; it requires infrastructure for software and for data that needs to be in place to fully realize value from modeling projects.
With each machine learning project, there are two projects encompassed inside of it and the second one starts when you deploy the first.
Iteration, Experimentation, and Challenges
Although these steps have a sequential nature to them, each step can yield information that causes you to make different decisions in steps prior. Deploying models can yield new insight into which models work with optimal performance or new data/concept drift checks, modeling can yield new insight into data quality needs or baselines, and much more.
It takes a strong strategic mindset to take vast uncertainty and shed light on crucial pieces for project success. You should approach these projects with a number of hypotheses for why certain ideas would work, why things are going well, why things are failing, where it might be an issue, and more. Experience building systems helps build this judgment incredibly well.
This need for iteration to move things forward and experimentation to test various ideas has to be carefully balanced with performance, timelines, and cost which is ultimately why this craft can be so challenging. Each step is filled with uncertainty and the best know-how is to navigate uncertain waters with sound judgment.
Figure 3: Machine Learning Project Lifecycle.
Treveil et. al. outline three main reasons for doing MLOps (and having strong MLOps infrastructure), and they have all been at the center focus for many organizations attempting to build machine learning systems.
- Risk Mitigation. Having an ML model living in your local Jupyter notebook that needs to be run once a month poses a minimal amount of risk, but when you have a model living on your website servicing millions of clients daily it can easily become a highly risky venture. A centralized team monitoring and maintaining a multitude of these kinds of models live in production can take on a lot of risks related to: the model is unavailable for a given period of time, the model returns a bad prediction for a given sample, the model accuracy or fairness decreases over time, the skills necessary to maintain the model (i.e., data science talent) are lost, open-source software libraries go down or are changed significantly. Even with a data governance council, you need true MLOps infrastructure to mitigate these business-critical issues.
- Responsible AI. These days it’s not enough to just build a high-performing model, it also should not unfairly discriminate against different demographics or subgroups. Although machine learning explicitly learns from past examples, it can easily be the case that past examples have baked in human biases. Overcoming these can be quite challenging, but machine learning systems must succeed along two dimensions: intentionality and accountability.
- Scale. MLOps is not just an option for scale, it is a necessity. Here are some of the many benefits that come along with this: keep track of versioning, especially with experiments in the design phase, understand whether retrained models are better than the previous versions (and promote models to production that are performing better), ensure (at defined periods — daily, monthly, etc.) that model performance is not degrading in production, etc.
In Introducing MLOps by Treveil et. al., the authors highlight three key reasons that make managing this entire machine learning lifecycle at scale so challenging.
- Many dependencies. Data and business needs are constantly changing so models in production need to be continually monitored and evaluated that they’re operating in accordance with expectations and addressing the original problem.
- Many languages. This isn’t just in reference to coding languages, either. There are such a wide variety of people involved in this process — business leaders, data scientists/analysts/engineers, machine learning engineers, IT teams, etc. Most of these people are using a myriad of tools and concepts so it can be challenging to coordinate across wide enterprises.
- Many skillsets. Similar to the previous issue, it can be a massive ask to expect data scientists to know all of the ins and outs of software engineering (and vice-versa). Although MLOps tools may make this bridge easier to traverse over time, the current scenario is hard to pinpoint a single role to champion without causing burnout. This difficulty naturally causes a lot of turnover and, by consequence, a larger complexity when expected to manage models you didn’t create.
I’d add another challenge:
- Lack of robust data infrastructure. Mature data organizations have a data dictionary system, good documentation of what can be found where example queries people have used to obtain common needs, a quality indicator of which fields/tables/systems are down or have issues, and much more. Having an easy EDA tool that monitors existing and new data in a rapidly changing environment can be an immense aid in building machine learning systems.
In future parts of this series, I’ll cover each section in more depth by clarifying concepts with greater detail and even showing how it can be done via TensorFlow. I’m currently reading some MLOps books on O’Reilly, finishing up this Machine Learning Engineering for Production Specialization by DeepLearning.ai, and practicing this work in my day job so follow along if you’d like to learn more!
 Treveil, et. al., O’Reilly Media, Introducing MLOps
 DeepLearning.ai, Coursera, Machine Learning Engineering for Production (MLOps) Specialization
 Chip Huyen, Stanford,po CS 329S: Machine Learning Systems Design
About the Author on Machine Learning Systems: Ani Madurkar
I’m a Senior Data Scientist by day & an artist by night. I’m a person who deeply loves storytelling, and all my passions circle around this point. From Philosophy to Photography to Data Science, I enjoy crafting interesting and insightful stories.
I work for a boutique consulting firm, Fulcrum Analytics, based in New York City and build enterprise Machine Learning systems for top companies and startups. In my spare time, I read lots of books on philosophy, psychology, business strategy, statistics, and more. I love refining my craft to get as close to mastery as I can so I will often have a passion project or two I’m working on, which I try to document and share on this platform and my Twitter.
Outside of projects, reading, and writing, I’m traveling and taking pictures of landscapes. Feel free to check out my website for some of my work (more to be uploaded soon) or my Instagram if you’re curious (at)animadurkar.