fbpx
Evaluate ML Models with Azure Machine Learning’s Responsible AI Insights Evaluate ML Models with Azure Machine Learning’s Responsible AI Insights
In December 2021, we introduced the Responsible AI dashboard, a comprehensive experience bringing together several mature Responsible AI tools in the... Evaluate ML Models with Azure Machine Learning’s Responsible AI Insights

In December 2021, we introduced the Responsible AI dashboard, a comprehensive experience bringing together several mature Responsible AI tools in the areas of data explorer (to proactively identify whether there is sufficient data representation for the variety of data subgroups), fairness assessment (to assess and identify your model’s group fairness issues), model interpretability (to understand how features are impacting your model predictions), error analysis (to easily identify error distributions across your data cohorts), counterfactual and causal inference analysis (to empower you to make responsible model-driven and data-driven decisions). The dashboard aims to address the issues of Responsible AI tool discoverability and fragmentation by enabling:

  1. Model Debugging: Evaluate machine learning models by identifying model errors, diagnosing why those errors are happening, and mitigating them.
  2. Responsible Business Decision Making: Boost your data-driven decision-making abilities by addressing questions such as “what is the minimum change the end user could apply to his/her features to get a different outcome from the model?” and/or “what is the causal effect of reducing red meat consumption on diabetes progression?

thumbnail image 1 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

The Responsible AI dashboard is now integrated and generally available with the Azure Machine Learning (Azure Machine Learning) platform, enabling our cloud customers to use a variety of experiences (via CLI, SDK, and no-code UI wizard) to generate Responsible AI dashboards for their machine learning models, enhancing their model debugging and understanding processes.

In public preview, the Responsible AI scorecard, is a reporting feature which can also be generated in Azure Machine Learning to create and share reports surfacing key data characteristics, and model performance and fairness insights. The scorecard helps contextualize the model and data health insights with both technical and non-technical audiences, bringing stakeholders along as well as assisting in compliance reviews.

Walkthrough of the Responsible AI dashboard

In this article, the scenario we will walk through is a linear regression model used for the hypothetical purpose of determining developer access to a GPT-2 model published for a limited group of users. In the following sections, we will dive deeper into how the Responsible AI dashboard can be used to debug the data and model and inform better decision making. The regression model is trained on a historical dataset of programmers who were scored from 0 to 10 based on characteristics such as age, geographical region, what operating system they use, employer, style of coding, and so on. If the model predicts a score of 7 to 10, then they are allowed access. A sample of the synthetic data is below

First name Last name Score (target) Style YOE IDE Programming language Location Number of GitHub repos contributed to Employer OS Job title Age
Bryan Ray 8 spaces 16 Emacs R Antarctica 2 Snapchat MacOS Principal Engineer 32
Donovan Lucero 3 tabs 9 pyCharm Swift Antarctica 2 Instagram Linux Distinguished Engineer 35
Dean Hurley 1 tabs 7 XCode C# Antarctica 0 Uber MacOS Senior Engineer 32
Nathan Weaver 6 spaces 15 Visual Studio R Antarctica 0 Amazon Linux Principal Engineer 32
Raelyn Sloan 5 tabs 7 Eclipse Java Antarctica 0 Twitter Windows SWE 2 33.1

 

Essentially, this model is allocating opportunity across different developers. So, we should take a closer look at this model to identify what kind of errors it’s making, diagnose what is causing those errors, and use those insights to improve the model. After uncovering those evaluation insights on our model, we can share them via the Responsible AI scorecard with other stakeholders who also want to ensure the app’s transparency and robustness and build trust with our end users.

The Responsible AI dashboard can be generated via a code-first CLI v2 and SDK v2 experience or a no-code method via Azure Machine Learning’s studio UI.

Generating a Responsible AI dashboard

Using Python with the Azure Machine Learning SDKv2

An Azure Machine Learning training pipeline job can be configured and executed remotely with a python notebook using the Azure Machine Learning SDKv2. Once you train your model and register it, you can create a Responsible AI dashboard by first, selecting the components you would like to activate in the dashboard, specifying the input and outputs of each component, and creating a component job for each of them. The components available by default in all Azure Machine Learning workspaces are:

  • An initial constructor: this will hold all the other components such as explanations, error analysis, etc.
  • Adding an explanation: This also provides our data exploration and model overview in the Responsible AI dashboard.
  • Adding causal analysis: we’re interested in using the historic data to uncover the causal effect of the number of GitHub repos programmers contributed to and years of experience on their score.
  • Adding counterfactual analysis: we want to generate 10 counterfactual examples per datapoint, leading to the prediction value to have the desired score of 7 to 10.
  • Adding error analysis: we can optionally specify generating a heat map for error distributions across the features of style and employer.
  • Finally, a ‘gather’ component: this will assemble all our Responsible AI insights into the dashboard.
def rai_programmer_regression_pipeline(
    target_column_name,
    train_data,
    test_data,
    score_card_config_path,
):
 # Initiate the RAIInsights
    create_rai_job = rai_constructor_component(
        title="RAI Dashboard Example",
        task_type="regression",
        model_info=expected_model_id,
        model_input=Input(type=AssetTypes.MLFLOW_MODEL, path=Azure Machine Learning_model_id),
        train_dataset=train_data,
        test_dataset=test_data,
        target_column_name=target_column_name,
        categorical_column_names=categorical_columns,
    )
    create_rai_job.set_limits(timeout=120)

    # Add an explanation
    explain_job = rai_explanation_component(
        comment="Explanation for the programmers dataset",
        rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
    )
    explain_job.set_limits(timeout=120)

    # Add causal analysis
    causal_job = rai_causal_component(
        rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
        treatment_features=treatment_features,
    )
    causal_job.set_limits(timeout=180)

    # Add counterfactual analysis
    counterfactual_job = rai_counterfactual_component(
        rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
        total_cfs=10,
        desired_range=desired_range,
    )
    counterfactual_job.set_limits(timeout=600)

    # Add error analysis
    erroranalysis_job = rai_erroranalysis_component(
        rai_insights_dashboard=create_rai_job.outputs.rai_insights_dashboard,
        filter_features=filter_columns,
    )
    erroranalysis_job.set_limits(timeout=120)

    # Combine everything
    rai_gather_job = rai_gather_component(
        constructor=create_rai_job.outputs.rai_insights_dashboard,
        insight_1=explain_job.outputs.explanation,
        insight_2=causal_job.outputs.causal,
        insight_3=counterfactual_job.outputs.counterfactual,
        insight_4=erroranalysis_job.outputs.error_analysis,
    )
    rai_gather_job.set_limits(timeout=120)

    rai_gather_job.outputs.dashboard.mode = "upload"
    rai_gather_job.outputs.ux_json.mode = "upload"

    return {
        "dashboard": rai_gather_job.outputs.dashboard,
        "ux_json": rai_gather_job.outputs.ux_json,
    }

With our components defined, we can assemble our pipeline job and submit it to Azure Machine Learning. Model performance and fairness disparity metrics along with dataset explorer are automatically generated for your Responsible AI dashboard.

Using YAML with the Azure Machine Learning CLIv2

Alternatively, we can create this job with a YAML file to automate creating the Responsible AI dashboard in your MLOps via the Azure Machine Learning CLIv2 experience. We can specify all the jobs that we want to kick off: training the model, registering the model, and then creating the Responsible AI dashboard with a YAML file then executing the job with a single line from the CLI.

jobs:

  create_rai_job:

    type: command

    inputs:

      model_input:

        type: mlflow_model

        path: Azure Machine Learning:<model_name>:<model_version>

      title: RAI Dashboard Example

      task_type: regression

      model_info: <model_name>:<model_version>

      categorical_column_names: '["location", "style", "job title", "OS", "Employer",

        "IDE", "Programming language"]'

      train_dataset:

        path: ${{parent.inputs.train_data}}

      test_dataset:

        path: ${{parent.inputs.test_data}}

      target_column_name:

        path: ${{parent.inputs.target_column_name}}

    component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_insight_constructor/versions/<version>

  explain_job:

    type: command

    inputs:

      comment: Explanation for the programmers dataset

      rai_insights_dashboard:

        path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}

    component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_explanation/versions/0.1.0

  causal_job:

    type: command

    inputs:

      treatment_features: '["Number of github repos contributed to", "YOE"]'

      rai_insights_dashboard:

        path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}

    component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_causal/versions/0.1.0

  counterfactual_job:

    type: command

    inputs:

      total_CFs: '10'

      desired_range: '[5, 10]'

      rai_insights_dashboard:

        path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}

    component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_counterfactual/versions/0.1.0

  erroranalysis_job:

    type: command

    inputs:

      filter_features: '["style", "Employer"]'

      rai_insights_dashboard:

        path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}

    component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_erroranalysis/versions/0.1.0

  rai_gather_job:

    type: command

    inputs:

      constructor:

        path: ${{parent.jobs.create_rai_job.outputs.rai_insights_dashboard}}

      insight_1:

        path: ${{parent.jobs.explain_job.outputs.explanation}}

      insight_2:

        path: ${{parent.jobs.causal_job.outputs.causal}}

      insight_3:

        path: ${{parent.jobs.counterfactual_job.outputs.counterfactual}}

      insight_4:

        path: ${{parent.jobs.erroranalysis_job.outputs.error_analysis}}

    outputs:

      dashboard:

        mode: upload

        type: uri_folder

        path: ${{parent.outputs.dashboard}}

      ux_json:

        mode: upload

        type: uri_folder

        path: ${{parent.outputs.ux_json}}

    component: Azure Machine Learning://registries/Azure Machine Learning/components/microsoft_Azure Machine Learning_rai_tabular_insight_gather/versions/0.1.0

Read more about how to create the Responsible AI dashboard with Python and YAML in SDKv2/CLIv2.

Using no-code guided UI wizard in Azure Machine Learning studio

Finally, we can create this job without leaving the Azure Machine Learning studio at all with a no-code wizard experience. If we go to our list of registered models, we first select the model we want to generate Responsible AI insights for, click on the “Responsible AI” tab, and click the “Create Responsible AI insights > Create dashboard” button.

thumbnail image 2 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

You first pick a train-and-test dataset that was used to train and test your model.

thumbnail image 3 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

For this scenario, we will be choosing regression to match our model.

thumbnail image 4 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

For the Responsible AI dashboard components that we’re interested in, we can choose either the debugging profile or the real-life interventions profile.

thumbnail image 5 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

We’ll move forward with model debugging and customize the dashboard to include error analysis, counterfactual analysis, and model explanation. For error analysis, I can choose up to two features to pre-generate an error heat map for. For counterfactual analysis, I’m interested in seeing a diverse set of examples (let’s say 10 examples for each datapoint) where we automatically perturb features just enough, so they receive a score of 7 to 10. We can even control which features are being perturbed if we don’t want certain features to be changed.

thumbnail image 6 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

Once that all looks good, we can move on to the final step to configure our experiment. We can name our job that will generate our Responsible AI dashboard, and either select an existing experiment to kick off the job in or create a new one. We’ll create a new one with the necessary resources and hit ‘Create’ and kick off the job.

thumbnail image 7 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

With that, we can jump into the Azure Machine Learning studio to see if the job has been successfully completed and we can see the resulting Responsible AI dashboard for our model showing up.

Read more about how to create the Responsible AI dashboard with no-code UI wizard in Azure Machine Learning studio.

Viewing the Responsible AI dashboard

The Responsible AI dashboard is a dynamic and interactive interface to investigate your model and data built on a host of open-sourced state-of-the-art technology. You can view your dashboard(s) by navigating to the registered model you have generated a Responsible AI dashboard for. Clicking on the Responsible AI tab will take you to your dashboards.

thumbnail image 8 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

We enable an integration of your workspace compute resources to access all the features such as retraining error trees, recalculating probabilities and generating insights in real time.

thumbnail image 9 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

The different components of the Responsible AI dashboard are designed such that they can easily communicate with each other. You can create cohorts of your data to slice and dice your analysis and interactively pass cohorts and insights from one component to another for deep-dive investigations. You can hide the different components you’ve generated for the dashboard in the “dashboard configuration” or add them back by clicking the blue “plus” icon.

We first look at our error tree, which tells us where the distribution of most of our errors lie. It seems that our models made the greatest number of errors for programmers living in Antarctica who don’t program in C, PHP, or Swift and don’t contribute that often to GitHub repos. We can easily save this as a new cohort to investigate later, but in the meanwhile it will show up as a “Temporary cohort” in the subsequent components.

thumbnail image 10 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

When looking at our model overview, we can get a high-level view of the model prediction distribution to help build intuition for the next steps in model debugging. In the “Feature cohorts” tab, we can also see Fairness metrics in the second table. The two rows display difference and ratios of the performance metrics as shown in the columns in the first table. For example, we see that there is a huge disparity between those who use spaces versus columns with the difference in mean absolute error of 659.563.

We can use the data explorer to see if feature distribution in our dataset is skewed. This can cause a model to incorrectly predict datapoints belonging to an underrepresented group or to be optimized along an inappropriate metric. If we bin our x-axis to be the ground truth of different scores a programmer can get (where 7-10 is the accepted range) and look at the style, we see that there is a highly skewed distribution of programmers who use tabs being scored lower and programmers who use spaces being scores higher.

thumbnail image 11 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

Additionally, since we know our model made the most amount of error for those living in Antarctica, when we investigate location, we see a highly skewed distribution of programmers living in Antarctica who were scored lower. What this means is that our model will unfairly favor those who are using spaces, and not living in Antarctica when providing access to the application we built.

thumbnail image 12 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

Coming down to our aggregate feature importance, we can see for our overall model, which features were the most important to the model’s predictions; and we can see that style (tabs or spaces) is by far the most considered, then operating system then programming language. If we click into style, we can see that using ‘spaces’ has a positive feature importance and ‘tabs’ has a negative feature importance showing us that ‘spaces’ is what contributes to a higher score.

thumbnail image 13 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

thumbnail image 14 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

thumbnail image 15 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

We can also look at two specific programmers who got a low and high score. Row 35 has a high score and uses spaces and row 2 has a low score and uses tabs. When we look at the individual feature importance of each programmers’ features, we can see that the ‘spaces’ positively contributed to Row 35’s high score, while ‘tabs’ contributed negatively towards a lower score for Row 2.

thumbnail image 16 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

thumbnail image 17 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

We can take a deeper look with counterfactual what-if examples. When selecting someone below the 7 to 10 range prediction, we can see what bare minimum changes could happen to their features to lead to much higher predictions. In this programmer’s case, some recommended changes would be switching their style to spaces.

thumbnail image 18 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

Finally, if we wanted to purely use historic data to identify the features that have the most direct effect on our outcome of interest, in this case the score, we can use causal analysis.  In our case, we want to understand the causal effect of years of experience and number of GitHub repos a programmer has contributed to on the score. The aggregate causal effects show you overall for your whole dataset, on average, increasing the number of GitHub repos by 1 increases the score by 0.095 whereas increasing the number of years of experience by 1 doesn’t increase the score by much at all.

thumbnail image 19 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

However, if we want to look at individual programmers and perturb those values and see the outcome of specific treatments to years of experience, we can see that for some programmers, increasing the years of experience does cause the score to increase by a bit.

thumbnail image 20 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

Additionally, the treatment policy tab can help us decide what overall treatment policy to take to maximize real-world impact on our score.  We can see the best future interventions to apply to certain segmentations of our programmer population to see the biggest boost in the scores overall.

thumbnail image 21 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

thumbnail image 22 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

And if you can only focus on 10 programmers to reach out to, you can see a ranked list of top k programmers who would gain the most from either increasing or decreasing the number of GitHub repos.

thumbnail image 23 of blog post titled Responsible AI Dashboard and Scorecard in Azure Machine Learning

Read the UI overview of how to use the different charts and visualizations of the Responsible AI dashboard.

Next steps

Learn more about the RAI dashboard and scorecard in the Microsoft documentation and generate them today to boost justified trust and appropriate reliance in your AI-driven processes.

Acknowledgments:

In the past year, our teams across the globe have joined forces to release the very first one-stop-shop dashboard for easy implementation of responsible AI in practice, making these efforts available to the community as open source and as part of the Azure Machine Learning ecosystem. We acknowledge their great efforts and are excited to see how you use this tool in your AI lifecycle.

Azure Machine Learning:

  • Responsible AI development team: Steve Sweetman, Lan Tang, Ke Xu, Roman Lutz, Richard Edgar, Ilya Matiach, Gaurav Gupta, Kin Chan, Vinutha Karanth, Tong Yu, Ruby Zhu
  • AI marketing team: Thuy Ngyuen, Melinda Hu, Trinh Duong
  • Additional thanks to Seth Juarez, Christian Gero, Manasa Ramalinga, Vijay Aski, Anup Shirgaonkar, Benny Eisman for their contributions to this launch!

Microsoft Research:

Minsoo Thigpen

About the author: Minsoo Thigpen is a Product Manager at Microsoft Azure Machine Learning designing and building out Responsible AI tools for data scientists. She has bachelor’s degrees in Applied Mathematics and Painting from Brown University and Rhode Island School of Design (RISD). Coming from an interdisciplinary background with experience in building machine learning models and products, analyzing data, and designing UX, she is always finding work at the intersection of AI/ML, design, and social sciences to empower data and ML practitioners to work ethically and responsibly end-to-end.

1