This is the second part of my blog posts on machine learning monitoring. In the first part, we listed the four questions we are trying to address in a machine learning model monitoring setup. We discussed the first two on how to detect functionality degradation of a model in production, as well as how to detect applications of the model on non-optimal populations or even completely inapplicable zones. In this post, we will look at how to detect a change of learned relationships between the input variables and target for a supervised learning model, as well as how to discover new relationships to continuously improve the model.
3. Has the learned relationship changed?
Supervised machine learning models are trained to learn the statistical correlations between the input variables and the target. Taking our credit risk model again as an example, we are learning the correlations between the information we had at the time of loan approval and whether they repaid in full at the end of the term. These correlations are not guaranteed to be static over time. And the model performance metrics can degrade if the learned relationship changes.
At a high level, we can monitor the performance metrics such as ROCAUC or accuracy on the observed outcome of the application data periodically. As discussed in the first blog post, the metrics can change as the input variables change their distributions, even if the learned relationships are static. Therefore, we would compare the observed metrics against the kNN reweighted metrics from the benchmark sample (typically the training dataset). If the observed value is worse than the reweighted expectation, it would be an indication of relationship changes.
To further monitor the relationship of individual input variables with the target, we also regularly carry out the residual analysis. The residual is defined as the deviation between the prediction and the outcome. On the training dataset, we would expect the average residuals to be close to zero across the range of every input variable. On the application data, we might detect non-zero overall residual, which indicates a change of relationship. We can also identify some of the input variables which have their relationship with the outcome changed, if the residuals exhibit correlations with their values.
To mitigate the model degradation, we should try to find out the cause of relationship changes. Sometimes it is due to the change of meaning of an input variable, e.g. reduced spending during the pandemic. Such change would typically also trigger the distributional change monitoring described in the previous blog. Other times, it is a merely quantitative drift of the relationship between the inputs and the outcome. For example, as lenders are scaling back in anticipation of recession, borrowers relying heavily on credit (a typical input variable) might be more likely to have cash flow problems (hence not repaying as an outcome).
If we believe that the statistical correlation will be stable going forward, we can re-train the model completely with data after the change. Alternatively, we can derive an adjustment function of the affected variable based on the residuals. While the former is straightforward in most cases, the latter solution requires less data and minimizes the statistical variance from the existing model.
If instead, we think the variable is going to be too volatile in the future, the easiest solution might be a refit with this variable excluded. This would lead to a bit loss of model performance, but it can be achieved with just the original training data.
4. Is there any new signal emerging?
New causative factor
The change of statistical correlation between the model inputs and the outcome might also be spurious if it’s just the confounding effect of a new causative factor not accounted for in the model. After all, machine learning models are just learning the statistical correlations, while the input variables are often mere proxies of underlying causes of the outcome. Instead of looking for the causation of model degradation from the input variables, sometimes we might find a more likely explanation among variables not included in the model yet.
That’s why we also run the residual analysis over our entire feature store, which contains thousands of variables as potential input to the model. If any of the variables has extra explanatory power, the residuals would exhibit a correlation with its value.
The complication here is that there are a lot of covariances among this large pool of variables (including existing model inputs). It’s likely that we will find a not-so-small collection of variables correlated with the residuals. Few of them would be the causative factor, if any at all. It’s entirely possible that the causative factor is not observable to us, and we just have to find the best proxy for our statistical modeling.
Therefore, we’d typically go through an iterative feature selection among those variables detected by the residual analysis. Each round, we would adjust the model by adding the variable of the most explanatory power of the residuals, before running the residual analysis again. Typically, we would only need very few extra variables to eliminate the residuals over every dimension.
Occasionally, the change of relationship might also be the result of changes in the underlying mechanism of the outcome, e.g. when payments did not happen due to the payment holidays lenders are obliged to offer during the pandemic. In another word, some borrowers stop paying not due to a lack of capability, but just as a precaution against recession. Such change would usually also manifest as a variation in the outcome distribution, which we also monitor closely.
In this scenario, we should review the definition of the outcome according to the business context. We might assume that all borrowers taking payment holidays are going to repay after the crisis and keep using the existing model to predict long term insolvency. Alternatively, we might want to predict the take-up of payment holidays as another type of hazard, either by extending our model into a multi-class or multi-target one, or train a separate model for the new outcome.
In these two blog posts, we have looked at the four types of questions that need to be addressed in a machine learning model monitoring. Green lights over these questions would give us great assurance of business continuity. As the world is changing at a rapid pace, it is pragmatic for data scientists to anticipate some deterioration and be prepared to find mitigation quickly. Hope this article is useful for you. And looking forward to further discussions on this topic at the upcoming ODSC Europe in September.
Editor’s note: Dr. Jiahang Zhong is a speaker for ODSC Europe 2020. Check out his talk, “Can Your Model Survive the Crisis: Monitoring, Diagnosis and Mitigation,” there! In his session, he will share some experience of model monitoring and diagnosis from a leading UK fintech company.
About the author/ODSC Europe speaker:
Dr. Jiahang Zhong is the leader of the data science team at Zopa, one of the UK’s earliest fintech companies. He has broad experience in data science projects in credit risk, operational optimization, and marketing, with keen interests in machine learning, optimization algorithms, and big data technologies. Prior to Zopa, he worked as a PhD and Postdoctoral researcher on the Large Hadron Collider project at CERN, with a focus on data analysis, statistics, and distributed computing.