Machine Learning Operations (MLOps) can significantly accelerate how data scientists and ML engineers meet organizational needs. A well-implemented MLOps process not only expedites the transition from testing to production but also offers ownership, lineage, and historical data about ML artifacts used within the team. The data science team is now expected to be equipped with CI/CD skills to sustain ongoing inference with retraining cycles and automated redeployments of models. Many ML professionals are still forced to run MLOps manually and this reduces the time that they can focus on adding more value to business.
To address these issues, many individuals and groups have been developing specific accelerators and training material to address their individual needs or those of their customers. This resulted in a wide number of accelerators, code repositories, or even full-fledged products that were built using or on top of Azure Machine Learning (Azure ML). We collected and evaluated over 20 MLOps solution accelerators and code bases from across the organization. Over time, the lack of maintenance in some of these ‘popular’ repositories led to frustrating experiences for various reasons. The repositories suffered from a wide variety of coding patterns or made use of examples that only replicated a small portion of the real-life production workload.
Based on our analysis of these accelerators, we identified design patterns and code that we could leverage.
We brought together over 30+ resources from various countries and functions including the Azure ML Product and Engineering team to align the efforts and develop the MLOps v2 accelerator, aligning with the development of Azure Machine Learning v2 platform, CLI, and SDK. We now have a codebase that contains repeatable, automated, and collaborative workflows and patterns that include the best practices for deploying machine learning models to production.
For the customer, this helps them reduce the time it takes to bootstrap a new data science project and get it to production. Our accelerator provides templates that can be reused to establish a cookie-cutter approach that can work with various implementation patterns. The accelerator incorporates the key MLOps decisions such as the components of the repository, the structure of the repository, the link between model development and model deployment, and technology choices for each phase of the Data Science process. Currently, we support 24 different implementation patterns. The accelerator gives the customer flexibility, security, modularity, and scalability bundled with ease of use that helps them go from development to production with high speed.
Solution Accelerator Architecture
The MLOps v2 architectural pattern is made up of four modular elements representing phases of the MLOps lifecycle for a given data science scenario, the relationships and process flow between those elements, and the personas associated with ownership of those elements. Below is the MLOps v2 architecture for a Classical Machine Learning scenario on tabular data along with an explanation for each element.
Figure 1: Classical Machine Learning MLOps Architecture using AML
1. Data Estate: This element represents the organizational data estate, potential data sources, and targets for a data science project. Data Engineers would be the primary owners of this element of the MLOps v2 lifecycle. The Azure data platforms in this diagram are neither exhaustive nor prescriptive. However, data sources and targets that represent recommended best practices based on customer use cases will be highlighted and their relationships to other elements in the architecture indicated.
2. Admin/Setup: This element initiates the MLOps v2 Accelerator deployment. It consists of all tasks related to the creation and management of resources and roles associated with the project. These can include but may not be limited to:
a. Creation of project source code repositories.
b. Creation of Azure Machine Learning workspaces for the project.
c. Creation/modification of datasets and compute resources used for model experimentation and deployment.
d. Definition of project team users, their roles, and access controls to other resources.
e. Creation of CI/CD (Continuous Integration and Continuous Delivery) pipelines.
f. Creation of Monitors for collection and notification of model and infrastructure metrics.
Personas associated with this phase may be primarily Infrastructure Team but may also include all of Data Engineers, Machine Learning Engineers, and Data Scientists.
3. Model Development (Inner Loop): The inner loop element consists of your iterative data science workflow. A typical workflow is illustrated here from data ingestion, EDA (Exploratory Data Analysis), experimentation, model development and evaluation, to the registration of a candidate model for production. This modular element as implemented in the MLOps v2 accelerator is agnostic and adaptable to the process your team may use to develop models.
Personas associated with this phase include Data Scientists and ML Engineers.
4. Deployment (Outer Loop): The outer loop phase consists of pre-production deployment testing, production deployment, and production monitoring triggered by continuous integration pipelines upon registration of a candidate production model by the Inner Loop team. Continuous Deployment pipelines will promote the model and related assets through production and production monitoring as tests appropriate to your organization and use case are satisfied. Monitoring in staging/test and production environments facilitates data collection and action on issues related to model performance, data drift, and infrastructure performance, potentially requiring human-in-the-loop review, automated retraining, and reevaluation of the model, or revisiting the Development loop or Admin/Setup for new model development or infrastructure resource modification. The people associated with this phase are primarily ML Engineers.
The repository also features architecture specifically designed for Computer Vision (CV) and Natural Language Processing (NLP) use cases. Additional architecture tailored for Azure ML + Spark and IoT (Internet of Things) Edge scenarios are in development.
The MLOps repository contains the best practices that we have gathered to allow the users to go from development to production with minimum effort. We have also made sure that we do not get locked onto any single technology stack or any hard-coded examples. However, we have still attempted to make sure that the examples are easy to work with and extensible where needed. The MLOps repository includes the following matrix of technologies in the stack. Users will be able to pick any combination of items in the stack to serve their needs.
The repository Azure/mlops–v2: Azure MLOps (v2) solution accelerators. (github.com) is a mono-repo that the users can configure any of the architectural patterns to get started. The users can select one item from each column to configure a custom repo for their needs. For example, they can select the infrastructure that can be created using Terraform while using GitHub as their CI/CD platform. The data scientist can choose to pick their ML problem and how they want to deploy their pipeline in the language of their choice.
Table 1: Architectural Patterns
Items marked in bold are available right now.
These patterns can be deployed for either online or batch modes (scoring). They can be deployed on secure or non-secure which can include a single or multiple Azure Machine Learning workspace. Multiple workspaces often find a place in some customers to separate development, testing, and production workloads. For the latter, we also show examples of how the promotion of the model and/or code can either happen as part of the source control workflow or can optionally be configured as an approval step in the deployment pipeline.
MLOps v2 accelerator includes additional building blocks that can be optionally configured to be included in the workflow. These include:
1. Responsible AI: Though these form part of the regular Azure ML workspace, we now include these components as a step that can be reviewed by a human. This manual step can ensure that the developed model adheres to the responsible AI principles.
2. Security: We have included steps and best practices from GitHub’s advanced security scanning and credential scanning (also available in Azure DevOps) that can be incorporated into the workflow. For example, there is an example on how you can work with conda.yml and requirements.txt to enable security scans on the installed Python packages. This wouldn’t have been possible with the current way in which we set up the environment and inference containers. We have documentation d on how GitHub users can configure the repository to perform package scanning early in the process to prevent security issues from becoming a blocker to deploy the model to production.
Security is a prime concern and to ensure this, we have introduced support for secure workspaces. This will help teams maintain the confidentiality of their projects and data. In addition, we understand the importance of quickly validating concepts. So, we’ve created quick deploy examples with Azure DevOps (ADO), GitHub, and Microsoft Learn to help you get your ideas tested in no time.
Figure 2: A example of GitHub’s Dependabot alerts setup in the repository
3. Feathr Feature Store Integration: We are excited to announce the integration of Feathr as an enterprise-scale feature store in the MLOps V2 extended architectures. Deployment of Feathr is facilitated using a Terraform script and we provide a simple classical ML example to guide you through its implementation.
4. Multiple Datasets Registration and 3rd party Containers Support: In response to user requests, we have incorporated the ability to register multiple datasets in the MLOps V2 accelerator. We believe this will add a lot of flexibility to data scientists and ML engineers to work with various datasets for their projects. Furthermore, we have added support for the 3rd party or external containers. We’ve even added support for dependable python package scans via pip install in a docker container.
5. Model Observability: To be effective at monitoring and identifying model and data drift there needs to be a way to capture and analyze the data, especially from the production system. We have implemented Azure Data Explorer (ADX) as a platform to ingest and analyze data. The typical score.py is modified to push the data into ADX.
6. Model and Data Drift: We have two patterns for Model and Data Drift. One of the patterns is based on ADX while the other relies on performing the calculations as part of the Azure ML Pipelines. The choice of implementation is based on the customer’s preference. Both approaches include various statistical tests to estimate the drift between the target and predictions. We further plan to expand to include correlation analysis and other statistical tests like Jensen-Shannon divergence etc. for larger datasets (but few unique target values).
7. Containers: The included patterns also demonstrate how the users can bring their containers in the Azure ML workspace for training and inference that are built containers outside of the Azure Container Repository (ACR). We are in discussions with the Product and Engineering teams to bring DeepSpeed and ONNX-based examples that help the end users to train and infer their models faster.
The repository contains examples in how one can effectively work with MLFlow models that allow seamless and code-free API deployment for not only tabular but also deep learning models like that of computer vision.
The repository is structured in a way that the customer organization can drive consistency and reuse of the patterns across the organization. To do this, we created a separate repo that holds the individual steps, for example, to install azure cli or deploy the model. These can then be assembled in the main project repo’s pipeline as building blocks that can be adapted as per need. This also helps the pipeline readable while ensuring that the steps include the patterns that have been common (and approved) across the organization.
Example: The two code samples demonstrate how a template can be reused in the pipeline repository.
With the support from the Microsoft Learn team, we have content that offers self-paced and hands-on online content to get familiar with the various technologies involved. We have developed OpenHack content, a pilot of which was delivered internally in June 2022. We are currently incorporating the feedback from the session and improving the content, that is planned for release early next year.
The MLOps v2 accelerator is the de-facto MLOps solution from Microsoft going forward. As the accelerator continues to evolve, it will remain a one-stop for customers to get started with Azure. MLOps v2 accelerator provides flexibility, scalability, modularity, ease of use, and security to go from development to production quickly and is a must-consume for any AI/ML project on Azure. MLOps v2 not only unifies MLOps at Microsoft but sets new standards for any AI workloads that will run on Azure.
The development of the accelerator is a worldwide team effort who have contributed in many ways to the repository.
About the Authors
Scott Donohoo: Technical leader in Data Science and MLOps focused on helping data science organizations realize business value from operationalizing their machine learning models. Over twenty years of experience in on-premises and cloud IT, big data, machine learning and analytics, and enterprise software engineering.
Setu Chokshi: I’m a senior technical leader, innovator, and specialist in machine learning and artificial intelligence. I’m also a leader who has gained the respect of my team through active listening and delegating tasks aligned to talents. My background has occurred organically as technical triumphs have led to greater opportunities. I’ve been fortunate to have worked with industry behemoths—General Electric and Nielsen.