

Evaluate ML Models with Azure Machine Learning’s Responsible AI Insights
ModelingAzureMicrosoftposted by Minsoo Thigpen November 14, 2022 Minsoo Thigpen

In December 2021, we introduced the Responsible AI dashboard, a comprehensive experience bringing together several mature Responsible AI tools in the areas of data explorer (to proactively identify whether there is sufficient data representation for the variety of data subgroups), fairness assessment (to assess and identify your model’s group fairness issues), model interpretability (to understand how features are impacting your model predictions), error analysis (to easily identify error distributions across your data cohorts), counterfactual and causal inference analysis (to empower you to make responsible model-driven and data-driven decisions). The dashboard aims to address the issues of Responsible AI tool discoverability and fragmentation by enabling:
- Model Debugging: Evaluate machine learning models by identifying model errors, diagnosing why those errors are happening, and mitigating them.
- Responsible Business Decision Making: Boost your data-driven decision-making abilities by addressing questions such as “what is the minimum change the end user could apply to his/her features to get a different outcome from the model?” and/or “what is the causal effect of reducing red meat consumption on diabetes progression?”
The Responsible AI dashboard is now integrated and generally available with the Azure Machine Learning (Azure Machine Learning) platform, enabling our cloud customers to use a variety of experiences (via CLI, SDK, and no-code UI wizard) to generate Responsible AI dashboards for their machine learning models, enhancing their model debugging and understanding processes.
In public preview, the Responsible AI scorecard, is a reporting feature which can also be generated in Azure Machine Learning to create and share reports surfacing key data characteristics, and model performance and fairness insights. The scorecard helps contextualize the model and data health insights with both technical and non-technical audiences, bringing stakeholders along as well as assisting in compliance reviews.
Walkthrough of the Responsible AI dashboard
In this article, the scenario we will walk through is a linear regression model used for the hypothetical purpose of determining developer access to a GPT-2 model published for a limited group of users. In the following sections, we will dive deeper into how the Responsible AI dashboard can be used to debug the data and model and inform better decision making. The regression model is trained on a historical dataset of programmers who were scored from 0 to 10 based on characteristics such as age, geographical region, what operating system they use, employer, style of coding, and so on. If the model predicts a score of 7 to 10, then they are allowed access. A sample of the synthetic data is below
First name | Last name | Score (target) | Style | YOE | IDE | Programming language | Location | Number of GitHub repos contributed to | Employer | OS | Job title | Age |
Bryan | Ray | 8 | spaces | 16 | Emacs | R | Antarctica | 2 | Snapchat | MacOS | Principal Engineer | 32 |
Donovan | Lucero | 3 | tabs | 9 | pyCharm | Swift | Antarctica | 2 | Linux | Distinguished Engineer | 35 | |
Dean | Hurley | 1 | tabs | 7 | XCode | C# | Antarctica | 0 | Uber | MacOS | Senior Engineer | 32 |
Nathan | Weaver | 6 | spaces | 15 | Visual Studio | R | Antarctica | 0 | Amazon | Linux | Principal Engineer | 32 |
Raelyn | Sloan | 5 | tabs | 7 | Eclipse | Java | Antarctica | 0 | Windows | SWE 2 | 33.1 |
Essentially, this model is allocating opportunity across different developers. So, we should take a closer look at this model to identify what kind of errors it’s making, diagnose what is causing those errors, and use those insights to improve the model. After uncovering those evaluation insights on our model, we can share them via the Responsible AI scorecard with other stakeholders who also want to ensure the app’s transparency and robustness and build trust with our end users.
The Responsible AI dashboard can be generated via a code-first CLI v2 and SDK v2 experience or a no-code method via Azure Machine Learning’s studio UI.
Generating a Responsible AI dashboard
Using Python with the Azure Machine Learning SDKv2
An Azure Machine Learning training pipeline job can be configured and executed remotely with a python notebook using the Azure Machine Learning SDKv2. Once you train your model and register it, you can create a Responsible AI dashboard by first, selecting the components you would like to activate in the dashboard, specifying the input and outputs of each component, and creating a component job for each of them. The components available by default in all Azure Machine Learning workspaces are:
- An initial constructor: this will hold all the other components such as explanations, error analysis, etc.
- Adding an explanation: This also provides our data exploration and model overview in the Responsible AI dashboard.
- Adding causal analysis: we’re interested in using the historic data to uncover the causal effect of the number of GitHub repos programmers contributed to and years of experience on their score.
- Adding counterfactual analysis: we want to generate 10 counterfactual examples per datapoint, leading to the prediction value to have the desired score of 7 to 10.
- Adding error analysis: we can optionally specify generating a heat map for error distributions across the features of style and employer.
- Finally, a ‘gather’ component: this will assemble all our Responsible AI insights into the dashboard.
def rai_programmer_regression_pipeline( target_column_name, train_data, test_data, score_card_config_path, ): # Initiate the RAIInsights create_rai_job = rai_constructor_component( title="RAI Dashboard Example", task_type="regression", model_info=expected_model_id, model_input=Input(type=AssetTypes.MLFLOW_MODEL, path=Azure Machine Learning_model_id), train_dataset=train_data, test_dataset=test_data, target_column_name=target_column_name, categorical_column_names=categorical_columns, ) create_rai_job.set_limits(timeout=120) # Add an explanation explain_job = rai_explanation_component( comment="Explanation for the programmers dataset", rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard, ) explain_job.set_limits(timeout=120) # Add causal analysis causal_job = rai_causal_component( rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard, treatment_features=treatment_features, ) causal_job.set_limits(timeout=180) # Add counterfactual analysis counterfactual_job = rai_counterfactual_component( rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard, total_cfs=10, desired_range=desired_range, ) counterfactual_job.set_limits(timeout=600) # Add error analysis erroranalysis_job = rai_erroranalysis_component( rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard, filter_features=filter_columns, ) erroranalysis_job.set_limits(timeout=120) # Combine everything rai_gather_job = rai_gather_component( constructor=create_rai_job.outputs.rai_insights_dashboard, insight_1=explain_job.outputs.explanation, insight_2=causal_job.outputs.causal, insight_3=counterfactual_job.outputs.counterfactual, insight_4=erroranalysis_job.outputs.error_analysis, ) rai_gather_job.set_limits(timeout=120) rai_gather_job.outputs.dashboard.mode = "upload" rai_gather_job.outputs.ux_json.mode = "upload" return { "dashboard": rai_gather_job.outputs.dashboard, "ux_json": rai_gather_job.outputs.ux_json, }
With our components defined, we can assemble our pipeline job and submit it to Azure Machine Learning. Model performance and fairness disparity metrics along with dataset explorer are automatically generated for your Responsible AI dashboard.
Using YAML with the Azure Machine Learning CLIv2
Alternatively, we can create this job with a YAML file to automate creating the Responsible AI dashboard in your MLOps via the Azure Machine Learning CLIv2 experience. We can specify all the jobs that we want to kick off: training the model, registering the model, and then creating the Responsible AI dashboard with a YAML file then executing the job with a single line from the CLI.
jobs: create_rai_job: type: command inputs: model_input: type: mlflow_model path: Azure Machine Learning:<model_name>:<model_version> title: RAI Dashboard Example task_type: regression model_info: <model_name>:<model_version> categorical_column_names: '["location", "style", "job title", "OS", "Employer", "IDE", "Programming language"]' train_dataset: path: ${{parent.inputs.train_data}} test_dataset: path: ${{parent.inputs.test_data}} target_column_name: path: ${{parent.inputs.target_column_name}} component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_insight_constructor/versions/<version> explain_job: type: command inputs: comment: Explanation for the programmers dataset rai_insights_dashboard: path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}} component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_explanation/versions/0.1.0 causal_job: type: command inputs: treatment_features: '["Number of github repos contributed to", "YOE"]' rai_insights_dashboard: path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}} component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_causal/versions/0.1.0 counterfactual_job: type: command inputs: total_CFs: '10' desired_range: '[5, 10]' rai_insights_dashboard: path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}} component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_counterfactual/versions/0.1.0 erroranalysis_job: type: command inputs: filter_features: '["style", "Employer"]' rai_insights_dashboard: path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}} component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_erroranalysis/versions/0.1.0 rai_gather_job: type: command inputs: constructor: path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}} insight_1: path: ${{parent.jobs.explain_job.outputs.explanation}} insight_2: path: ${{parent.jobs.causal_job.outputs.causal}} insight_3: path: ${{parent.jobs.counterfactual_job.outputs.counterfactual}} insight_4: path: ${{parent.jobs.erroranalysis_job.outputs.error_analysis}} outputs: dashboard: mode: upload type: uri_folder path: ${{parent.outputs.dashboard}} ux_json: mode: upload type: uri_folder path: ${{parent.outputs.ux_json}} component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_insight_gather/versions/0.1.0
Read more about how to create the Responsible AI dashboard with Python and YAML in SDKv2/CLIv2.
Using no-code guided UI wizard in Azure Machine Learning studio
Finally, we can create this job without leaving the Azure Machine Learning studio at all with a no-code wizard experience. If we go to our list of registered models, we first select the model we want to generate Responsible AI insights for, click on the “Responsible AI” tab, and click the “Create Responsible AI insights > Create dashboard” button.
You first pick a train-and-test dataset that was used to train and test your model.
For this scenario, we will be choosing regression to match our model.
For the Responsible AI dashboard components that we’re interested in, we can choose either the debugging profile or the real-life interventions profile.
We’ll move forward with model debugging and customize the dashboard to include error analysis, counterfactual analysis, and model explanation. For error analysis, I can choose up to two features to pre-generate an error heat map for. For counterfactual analysis, I’m interested in seeing a diverse set of examples (let’s say 10 examples for each datapoint) where we automatically perturb features just enough, so they receive a score of 7 to 10. We can even control which features are being perturbed if we don’t want certain features to be changed.
Once that all looks good, we can move on to the final step to configure our experiment. We can name our job that will generate our Responsible AI dashboard, and either select an existing experiment to kick off the job in or create a new one. We’ll create a new one with the necessary resources and hit ‘Create’ and kick off the job.
With that, we can jump into the Azure Machine Learning studio to see if the job has been successfully completed and we can see the resulting Responsible AI dashboard for our model showing up.
Read more about how to create the Responsible AI dashboard with no-code UI wizard in Azure Machine Learning studio.
Viewing the Responsible AI dashboard
The Responsible AI dashboard is a dynamic and interactive interface to investigate your model and data built on a host of open-sourced state-of-the-art technology. You can view your dashboard(s) by navigating to the registered model you have generated a Responsible AI dashboard for. Clicking on the Responsible AI tab will take you to your dashboards.
We enable an integration of your workspace compute resources to access all the features such as retraining error trees, recalculating probabilities and generating insights in real time.
The different components of the Responsible AI dashboard are designed such that they can easily communicate with each other. You can create cohorts of your data to slice and dice your analysis and interactively pass cohorts and insights from one component to another for deep-dive investigations. You can hide the different components you’ve generated for the dashboard in the “dashboard configuration” or add them back by clicking the blue “plus” icon.
We first look at our error tree, which tells us where the distribution of most of our errors lie. It seems that our models made the greatest number of errors for programmers living in Antarctica who don’t program in C, PHP, or Swift and don’t contribute that often to GitHub repos. We can easily save this as a new cohort to investigate later, but in the meanwhile it will show up as a “Temporary cohort” in the subsequent components.
When looking at our model overview, we can get a high-level view of the model prediction distribution to help build intuition for the next steps in model debugging. In the “Feature cohorts” tab, we can also see Fairness metrics in the second table. The two rows display difference and ratios of the performance metrics as shown in the columns in the first table. For example, we see that there is a huge disparity between those who use spaces versus columns with the difference in mean absolute error of 659.563.
We can use the data explorer to see if feature distribution in our dataset is skewed. This can cause a model to incorrectly predict datapoints belonging to an underrepresented group or to be optimized along an inappropriate metric. If we bin our x-axis to be the ground truth of different scores a programmer can get (where 7-10 is the accepted range) and look at the style, we see that there is a highly skewed distribution of programmers who use tabs being scored lower and programmers who use spaces being scores higher.
Additionally, since we know our model made the most amount of error for those living in Antarctica, when we investigate location, we see a highly skewed distribution of programmers living in Antarctica who were scored lower. What this means is that our model will unfairly favor those who are using spaces, and not living in Antarctica when providing access to the application we built.
Coming down to our aggregate feature importance, we can see for our overall model, which features were the most important to the model’s predictions; and we can see that style (tabs or spaces) is by far the most considered, then operating system then programming language. If we click into style, we can see that using ‘spaces’ has a positive feature importance and ‘tabs’ has a negative feature importance showing us that ‘spaces’ is what contributes to a higher score.
We can also look at two specific programmers who got a low and high score. Row 35 has a high score and uses spaces and row 2 has a low score and uses tabs. When we look at the individual feature importance of each programmers’ features, we can see that the ‘spaces’ positively contributed to Row 35’s high score, while ‘tabs’ contributed negatively towards a lower score for Row 2.
We can take a deeper look with counterfactual what-if examples. When selecting someone below the 7 to 10 range prediction, we can see what bare minimum changes could happen to their features to lead to much higher predictions. In this programmer’s case, some recommended changes would be switching their style to spaces.
Finally, if we wanted to purely use historic data to identify the features that have the most direct effect on our outcome of interest, in this case the score, we can use causal analysis. In our case, we want to understand the causal effect of years of experience and number of GitHub repos a programmer has contributed to on the score. The aggregate causal effects show you overall for your whole dataset, on average, increasing the number of GitHub repos by 1 increases the score by 0.095 whereas increasing the number of years of experience by 1 doesn’t increase the score by much at all.
However, if we want to look at individual programmers and perturb those values and see the outcome of specific treatments to years of experience, we can see that for some programmers, increasing the years of experience does cause the score to increase by a bit.
Additionally, the treatment policy tab can help us decide what overall treatment policy to take to maximize real-world impact on our score. We can see the best future interventions to apply to certain segmentations of our programmer population to see the biggest boost in the scores overall.
And if you can only focus on 10 programmers to reach out to, you can see a ranked list of top k programmers who would gain the most from either increasing or decreasing the number of GitHub repos.
Read the UI overview of how to use the different charts and visualizations of the Responsible AI dashboard.
Next steps
Learn more about the RAI dashboard and scorecard in the Microsoft documentation and generate them today to boost justified trust and appropriate reliance in your AI-driven processes.
- Learn more about the Responsible AI in Azure Machine Learning
- Learn more about how to generate and use the Responsible AI dashboard
- After you’ve generated your Responsible AI dashboard, view how to access and use it in Azure Machine Learning studio.
- Summarize and share your Responsible AI insights with the Responsible AI scorecard as a PDF export
- Learn more about the concepts and techniques behind the Responsible AI dashboard.
- Learn more about how to collect data responsibly.
- View sample YAML and Python notebooks to generate the Responsible AI dashboard with YAML or Python.
- Learn about how the Responsible AI dashboard and scorecard were used by the UK National Health Service (NHS) in a real life customer story.
- Explore the features of the Responsible AI dashboard through this interactive AI lab web demo.
Acknowledgments:
In the past year, our teams across the globe have joined forces to release the very first one-stop-shop dashboard for easy implementation of responsible AI in practice, making these efforts available to the community as open source and as part of the Azure Machine Learning ecosystem. We acknowledge their great efforts and are excited to see how you use this tool in your AI lifecycle.
Azure Machine Learning:
- Responsible AI development team: Steve Sweetman, Lan Tang, Ke Xu, Roman Lutz, Richard Edgar, Ilya Matiach, Gaurav Gupta, Kin Chan, Vinutha Karanth, Tong Yu, Ruby Zhu
- AI marketing team: Thuy Ngyuen, Melinda Hu, Trinh Duong
- Additional thanks to Seth Juarez, Christian Gero, Manasa Ramalinga, Vijay Aski, Anup Shirgaonkar, Benny Eisman for their contributions to this launch!
Microsoft Research:
- Responsible AI dashboard: Besmira Nushi (Research Lead)
- Causal Inference: Vasilis Syrgkanis, Eleanor Dillon, Keith Battocchi, Paul Oka, Emre Kiciman, Friederike Niedtner
- Error Analysis and Interpretability: Ece Kamar, Besmira Nushi, Saleema Amershi, Eric Horvitz, Rich Caruana, Paul Koch, Harsha Nori, Samuel Jenkins, Rahee Gosh Peshawaria
- Counterfactual Analysis: Amit Sharma
- AI Fairness: Hanna Wallach, Miro Dudik
- AI Ethics and Effects in Engineering and Research (Aether): Jingya Chen, Mihaela Vorvoreanu, Dean Carignan