Article by Scott Donohoo and Setu Chokshi of Microsoft.
MLOps means different things to different people, however, the fundamental essence of MLOps is to deliver models into productions faster with a consistent, repeatable, and reliable approach. Machine Learning Operations (MLOps) is key to accelerating how data scientists and ML engineers can impact organizational needs. A well-implemented MLOps process not only reduces the time from testing to production, but also provides ownership, lineage, and historical information of ML artifacts being used across the team.
A data science team is now required to be equipped with CI/CD skills to sustain ongoing inference with retraining cycles and automated redeployments of models. Unfortunately, many ML professionals are still forced to run many production tasks manually and this reduces the time that they can focus to add more value to the business while also introducing risk.
Based on our experience of working with large and small customers across the world and bringing together the best practices, Microsoft has developed a solution accelerator to do exactly what the word suggests – accelerate our customer’s journey to production.
Key pillars of this solution accelerator are simplicity and segregation of duties, which means that our intention is that Data Scientists, ML Engineers and IT teams don’t need significant upskilling before they can do MLOps. For example, the Data Scientists just need to focus on the training and inferencing scripts and MLOps will “just work” for them provided they follow the pattern laid out by the accelerator. While we enable the division of work between ML Engineers and Data Scientists, the accelerator unifies the components in a simple way that is easy for both roles to understand and implement MLOps.
The MLOps V2 solution accelerator allows AI professionals to deploy an end-to-end standardized and scalable machine learning lifecycle across multiple workspaces. By abstracting agnostic infrastructure in an outer loop, the customer can focus on the inner loop development of their use cases.
The accelerator project team collected and evaluated over twenty existing MLOps assets including codebase, scalable capacity, and customer requirements to redefine the next generation of MLOps at Microsoft. Many older MLOps approaches included customer-specific unscalable or outdated scenarios that did not provide technology modularity. With MLOps v2, we are moving Classical Machine Learning, Natural Language Processing, and Computer Vision to a newer and faster scale for our customers. The solution accelerator is modular allowing for inclusion of building blocks such as a Responsible AI and Feature Store.
Overall, the MLOps v2 solution accelerator serves as the starting point for MLOps implementation in Azure. Solution Accelerators enable customers to bootstrap projects that get them 80% of the way but allow for adaptability and customization for each unique project. It is a set of repeatable, automated, and collaborative workflows with best practices that empower teams of machine learning professionals to quickly and easily get their machine learning models deployed into production. You can learn more about MLOps here:
Among the challenges in effectively adopting MLOps is the general customer struggle in standing up an end-to-end MLOps engine due to resource, time, and skill constraints. To complicate matters, these skills often come from disparate roles across multiple organizations, each with their own distinct set of tools and enterprise ownership. Common customer challenges with MLOps include the various elements below:
MLOps V2 Solution Accelerator
MLOps v2 provides a templatized approach to these challenges in deploying an end-to-end Data Science process and focuses on driving efficiency for each of the following.
- Bring together the work of the organization in a project repository organized by role.
- Each time changes are committed, work is automatically built and tested, and bugs are detected faster.
- Code, data, models, and training pipelines are shared to accelerate innovation.
- Provide templates to bootstrap the infrastructure and model development environment, expressed as code.
- Automate the entire process from code commit to production.
- Monitor pipelines, infrastructure, and products in production and know when they aren’t behaving as expected.
The goals of the solution accelerator are:
- Enterprise readiness
The MLOps v2 accelerator consists of multiple templates based on pattern architectures that can be reused and to establish a “Cookie-Cutter-Approach” for the bootstrapping process to shorten the process from days to hours or minutes. The bootstrapping process encapsulates key MLOps decisions such as the components of the repository, the structure of the repository, the link between model development and model deployment, and technology choices for each phase of the Data Science process.
Architecturally, a header repository leverages template repositories to drive the deployment of individual technical patterns. The solution accelerator repositories are broken down by technology pattern illustrated below.
- A header MLOps V2 repository that serves as a project factory to bootstrap new MLOps projects with the patterns you select.
- An MLOps Templates repository that deploys base MLOps pipelines using your selected CI/CD infrastructure and tools.
- A Project Templates repository that deploys AzureML resources and infrastructure to support the desired ML scenario.
Figure 1: Repository Architecture of MLOps V2
The overall MLOps architectural pattern that the solution accelerator deploys is made up of four broad elements, Data Estate, Administration & Setup, Model Development, and Model Deployment representing high-level phases of the MLOps lifecycle for a given data science scenario, the relationships and process flow between those elements, and the personas associated with ownership of those elements.
Figure 2, below, illustrates the MLOps V2 architecture for a Classical Machine Learning scenario on tabular data along with an explanation for each of the main elements as well as the component elements within and between them.
1. Data Estate
This element illustrates the data estate of the organization, and potential data sources and targets for a data science project. Data engineers are the primary owners of this element of the MLOps v2 lifecycle. The Azure data platforms in this diagram are neither exhaustive nor prescriptive. The data sources and targets that represent recommended best practices based on the customer use case are indicated by a green check mark.
2. Administration & Setup
This element is the first step in the MLOps v2 solution accelerator deployment. It consists of all tasks related to the creation and management of resources and roles associated with the project. These can include the following tasks, and perhaps others:
* Creation of project source code repositories
* Creation of Machine Learning workspaces by using Bicep, ARM, or Terraform
* Creation or modification of datasets and compute resources that are used for model development and deployment
* Definition of project team users, their roles, and access controls to other resources
* Creation of CI/CD pipelines
* Creation of monitors for collection and notification of model and infrastructure metrics
The primary persona associated with this phase is the infrastructure team, but will also include the input of data engineers, machine learning engineers, and data scientists.
3. Model development (inner loop)
The inner loop element consists of the typical iterative data science workflow that acts within a dedicated, secure Machine Learning workspace. A common workflow pattern is illustrated in the diagram. It proceeds from data ingestion, exploratory data analysis, experimentation, model development and evaluation, to registration of a candidate model for production. This modular element as implemented in the MLOps v2 accelerator is agnostic and adaptable to the process your data science team uses to develop models.
Personas associated with this phase include data scientists and machine learning engineers.
4. Machine Learning registries
After the data science team develops a model that is a candidate for deploying to production, the model can be registered in the Machine Learning workspace registry. Continuous integration (CI) pipelines that are triggered, either automatically by model registration or by gated human-in-the-loop approval, promote the model and any other model dependencies to the model deployment phase.
Personas associated with this stage are typically machine learning engineers.
5. Model deployment (outer loop)
The model deployment or outer loop phase consists of pre-production staging and testing, production deployment, and monitoring of model, data, and infrastructure. Continuous deployment (CD) pipelines manage the promotion of the model and related assets through production, monitoring, and potential retraining, as criteria that are appropriate to your organization and use case are satisfied.
Personas associated with this phase are primarily machine learning engineers.
6. Staging and test
The staging and test phase can vary with customer practices but typically includes operations such as retraining and testing of the model candidate on production data, test deployments for endpoint performance, data quality checks, unit testing, and responsible AI checks for model and data bias. This phase takes place in one or more dedicated, secure Machine Learning workspaces.
7. Production deployment
After a model passes the staging and test phase, it can be promoted to production by using a human-in-the-loop gated approval. Model deployment options include a managed batch endpoint for batch scenarios or, for online, near-real-time scenarios, either a managed online endpoint or Kubernetes deployment by using Azure Arc. Production typically takes place in one or more dedicated, secure Machine Learning workspaces.
Monitoring in staging, test, and production makes it possible for you to collect metrics for, and act on, changes in performance of the model, data, and infrastructure. Model and data monitoring can include checking for model and data drift, model performance on new data, and responsible AI issues. Infrastructure monitoring can watch for slow endpoint response, inadequate compute capacity, or network problems.
9. Data and model monitoring: events and actions
Based on criteria for model and data matters of concern such as metric thresholds or schedules, automated triggers and notifications can implement appropriate actions to take. This can be regularly scheduled automated retraining of the model on newer production data and a loopback to staging and test for pre-production evaluation. Or, it can be due to triggers on model or data issues that require a loopback to the model development phase where data scientists can investigate and potentially develop a new model.
10. Infrastructure monitoring: events and actions
Based on criteria for infrastructure matters of concern such as endpoint response lag or insufficient compute for the deployment, automated triggers and notifications can implement appropriate actions to take. They trigger a loopback to the setup and administration phase where the infrastructure team can investigate and potentially reconfigure the compute and network resources.
Additional architectures tailored for Computer Vision and Natural Language Processing use cases are available. Information on these architectures is available in the Azure Architecture Center. Architectures for Azure ML + Databricks and IoT (Internet of Things) Edge scenarios are in development.
The MLOps repository contains the best practices that we have gathered to allow the users to go from development to production with minimum effort. We have also made sure that we do not get locked on to any single technology stack or any hard-coded examples. However, we have still attempted to make sure that the examples are easy to work with and expand where they need to be. In the MLOps repository, you will find the following matrix of technologies in the stack. The users will be able to pick any combination of items in the stack to serve their needs.
Table: Items marked in blue are currently available in the repository
We have also included some steps to demonstrate how you can use GitHub’s advanced security into your workflows, which include code scanning and dependency scanning. We plan to add more security features that the users can take advantage of in your workflows.
To get data scientists and machine learning engineers familiar with DevOps concepts and technologies used in the MLOps v2 accelerator, there is on MSLearn hands-on content on how to automate model training and deployment with GitHub Actions this summer. Microsoft Learn offers self-paced and hands-on online content to get yourself familiar with modern technologies.
To help organizations onboard the MLOps v2 accelerator, learning content still in development will be available on our Microsoft Learn platform.
As this project continues to evolve, the MLOps V2 team will speak over the next month at several events to communicate updates and news to ensure that consumers of MLOps v2 are up to date with the progress Microsoft has made.
MLOps v2 is the de-facto MLOps solution for Microsoft on forward. Aligned with the development of Azure Machine Learning v2, MLOps v2 gives you and your customer the flexibility, security, modularity, ease-of-use, and scalability to go fast to product with your AI. MLOps v2 not only unifies Machine Learning Operations at Microsoft, even more, it sets innovative new standards to any AI workload. Moving forward, MLOps v2 is a must-consume for any AI/ML project redefining your AI journey.
For more information, please join the Webinar “ODSC & Microsoft – Microsoft’s Accelerator for MLOps,” on September 15th: https://app.aiplus.training/courses/Microsofts-Accelerator-for-MLOps.
MLOps V2 Main Header Repository: Azure/mlops-v2: Azure MLOps (v2) solution accelerators. (github.com)
MLOps V2 Project Template Repository: Azure/mlops-project-template (github.com)
MLOps V2 CI/CD Template Repository: Azure/mlops-templates (github.com)
Build the MLOps V2 Demo: mlops-v2/QUICKSTART.md at main · Azure/mlops-v2 (github.com)
Azure Architecture Center: Machine learning operations (MLOps) v2 – Azure Architecture Center | Microsoft Docs
MSLearn Hands-On Learning MLOps Concepts: mslearn-mlops (microsoftlearning.github.io)
Scott Donohoo: I am a Senior Cloud Solution Architect specializing in customer success with machine learning and AI on Azure.
Setu Chokshi: I’m a senior technical leader, innovator, and specialist in machine learning and artificial intelligence. I have led and implemented machine learning products at scale for various companies.