At ODSC London 2018, Yuriy Guts of DataRobot gave a talk on data leakage, including potential sources of the problem and how it can be remedied.
Data leakage – also sometimes referred to as data snooping – is a phenomenon in machine learning that occurs when a model is trained on information that will not be available to it at prediction time. The result is a model that will produce optimistic estimates of its performance in the real world, even during testing.
The fact that models with data leakage tend to perform well out-of-sample during the evaluation phase makes it very difficult to detect. Many practitioners will only find out that their model is broken when they attempt to deploy it, which can be a costly mistake in industries such as health and finance.
This makes it that much more important for data scientists to stay apprised to the ways that data leakage can appear in a machine learning workflow. These are some of the key highlights from Mr. Guts’ talk on the topic, which you can watch in full on YouTube.
Leakage in Collection
Leakage can occur anywhere in a machine learning workflow, including from the onset with data collection. It may sound obvious, but a machine learning model must be trained only on information that will be available at prediction time. Using features that will not be available means that it will lack an input, fundamentally breaking the model.
Guts gave the example of predicting whether a loan is bad based on the number late payment reminders a loanee has received. When a bank is first writing a loan, they have no idea whether or not that person will be late on their loans in the future. They can guess, but that information is simply unavailable when the loanee signs on the dotted line. Building a machine learning model off of that information is a recipe for disaster.
Again, this may seem obvious, but it is possible to fall into this trap when working on high-dimensional data with hundreds or thousands of columns. Steer clear of this at all costs.
Leakage and Preprocessing, Feature Engineering
I once heard a professor describe data leakage as a “subtle, happy hell.” This point proves why.
It is important that the information we know about our model during training remains consistent across training and evaluation. This includes assumptions about the data distribution, especially the mean and standard deviation when performing normalization.
Most practitioners – including myself – typically drop their full dataset into the same collection and normalize it all at once before splitting the data into test and evaluation. While the code for this approach will be cleaner, this breaks fundamental assumptions about data leakage. Most importantly, we are using information from data that will appear in both the test and training data. This is because our mean and standard deviation will be based on the full dataset, not just the training data.
Some practitioners will normalize the two datasets separately, using different means and standard deviations. This is also incorrect since it breaks the assumption that the data is drawn from the same distribution.
Mr. Guts tells us that in order to remedy this, we must first separate our data into training and test sets. Then, once we normalize the training set, we apply the mean and standard deviation to the normalization of the test set. This is a very subtle source of data leakage that most are apt to miss, but important to creating the best machine learning model possible.
Leakage During Partitioning
According to Mr. Guts, partitioning is one of the most common sources of data leakage. When splitting data that involves multiple observations from the same person or source, the data must be partitioned such that all observations from a given user are included in one set and only one set. This also applies to validation, where that person’s observations must only appear in one fold during k-fold cross-validation. Failure to do this produces what is referred to as group leakage.
Guts pointed out that even the best can fall into this trap. He pointed to a paper released by Andrew Ng’s research group, which attempted to detect pneumonia from chest x-rays. The data included 112,120 unique images from 30,805 unique patients. It wasn’t until it was published that a practitioner on Twitter asked about whether the data was separated according to patient, which it wasn’t. This prompted Ng’s group to modify the experiment such that the data leakage was eliminated, resulting in less optimistic estimations of model performance.
It should be noted that this source of leakage is unlikely to dramatically impact model performance. In most cases, it will only make the difference of a few percentage points in accuracy; even so, it’s a methodological flaw that must be addressed in order to get the best possible performance from a model.
These are just a few examples of different sources of data leakage, but Mr. Guts goes into many other subtle ways that it can be introduced to your model. Check out his full talk on YouTube for a deep dive into how leakage can infiltrate seemingly bulletproof applications.