Outliers in Data Science: To Be or Not to Be an Anomaly?
Blogs from ODSC SpeakersConferencesModelingEurope 2020posted by ODSC Community August 10, 2020 ODSC Community
An outlier may be defined as an object that is out of ordinary, which differs significantly from the norm. In day to day examples, it could be a baby panda among adult pandas, a champion breaking a world record, or fraud emails in your inbox.
Why even bother to detect outliers?
- To react. If such an unusual behavior appears, especially a negative one, the reaction is a key. The sooner a fraud email is detected, the sooner it can be removed in order not to endanger the user. Detecting a machine’s fault in time may even save lives.
- To know ‘normality’. Taking the information about outliers into account while inferring may lead to incorrect conclusions. If a student failed one test while nailing all the rest, the ‘normal’ behavior is still the key for judgment (even if reaction – see point 1 – may be a good idea).
- To accurately predict.
How to detect anomalies?
From the modeling point of view, anomalies can be found in a lot of ways!
- Intended supervised approach. The most costly, with a high entry-level, not immune to pattern changes, yet quite effective in a stable environment. It requires the manual labeling of your data points as outliers and as typical observations. Once having the labels, good old classification methods may be applied. To speed up the process, the visual tool with the automatically retraining model behind the scenes is pretty useful. However, tool or no tool, labeling usually has the flavor of tediousness to it.
- A side-effect of a supervised approach. Let’s forget about outliers and just model the variable at hand in the best possible way, preferably with exogenous variables. Then, using prediction errors, identify observations with the highest discrepancies. Given the pattern capturing model has troubles fitting them, there is a pretty high chance those are not typical observations. Also, some methods like X13 have an outlier detection build in them.
- Unsupervised methods. Have a higher touch of uncertainty than the alternatives, but you may leverage those models right away, data and business knowledge are really all you need to start.
- Mixed approach. Anomaly detection is like playing detective – you arrive at the point of having a suspect, but still, human feedback may empower you the ‘evidence’. That’s why for example if you sign in the email from another device you are asked if that’s indeed you – an anomaly was detected, but for the model to improve, a confirmation is needed. The mixed approach is my personal favorite.
Walkthrough an example: detection
Let’s focus on univariate outliers and on unsupervised and side-effects of supervised methods representatives. For demonstration purposes, I’ll be using my own Fitbit data regarding climbed floors per day since the beginning of 2018:
Obvious outliers are to be caught with bare eyes – these peaks are mountain hiking days 😉 Let’s start outlier detection with basic iqr based algorithm:
The blue line presents the rolling (over 60 days) iqr based upper cutoff value. The rolling approach helps us understand how the outlier detection would look like if we applied it in the past, assuming about 2 months window representation. As you can see, it’s actually not bad! However, the lower iqr cutoff lies completely outside the scale of a plot – it’s below zero, so it doesn’t seem to be a correct approach. When zeros appear, especially in a row, it’s a quite unusual situation, usually health-related and should be picked up with the algorithm.
Another distribution-related method, Isolation Forest deals with outliers slightly better as it discovers both high and low extremities:
However, an adjustment of sample score cutoff (green) may be useful. Initially, the algorithm found quite a few outliers (about 12% of all observations).
When jumping to supervised methods, the result looks like this:
The algorithm identifies even more zero outliers than the Isolation Forest, but less non-zero ones.
Walkthrough an example: inference
An immediate output of the above-presented anomalies detection techniques was the identification of hiking days and lazy days. But what about inference? What could be the actual business case?
Let the purpose be to determine whether the climbing patterns improve or worsen in time. Long holidays impact is to be excluded as it’s just a temporary behavior and not a healthy lifestyle pattern change. Sickness is to be removed in order not to mix up laziness with health problems related to inactivity. Those two labels may be added semi-manually:
- Finding candidates for outliers.
- Identifying sequences of outliers (also known as temporary changes).
- Manually labeling the temporary changes.
After such holidays and sickness removal, data for inference are ready:
Please note how beautifully misleading conclusions may be when looking at the wrong data subset! When considering all data, it seemed the activity was admirable and COVID-19 spoiled the trend, as in reality, intensive holidays skewed the pattern and the drop wasn’t that extreme. When excluding all outliers and not only temporary changes, the 2020 activity seems unpromising, as one-day intensive exercises are being removed. In reality, the activity was decent in the past, it dropped in 2019 and now it’s slowly rising again. It shows how crucial it is to always keep in mind the business purpose when handling outliers.
I encourage you to put your own data under a microscope too!
Is that all?
We already feel pretty confident and intuitive in the world of one-dimensional outliers. Now let’s imagine a dataset with each day represented by time series instead of only one number. Then the anomaly detection complicates itself significantly, as not only the floors numbers differ, but also the dynamics of climbing them during the day. I’ll be talking more about this quite interesting challenge at this year’s ODSC Europe Conference. At my talk, “Multivariate (Flight) Anomalies Detection,” you will learn how to detect anomalies in multidimensional space and preceding that – how to distinguish data quality issues from anomalies. This time context will be set in the aviation industry, so flight profiles will be ‘taken under the data science microscope’. However, the proposed modeling approach is transferable to any other domain. I hope to see you there!
About the author/ODSC Europe speaker: Marta Markiewicz
Head of Data Science at Objectivity with a background in Mathematical Statistics. For about 9 years, she has been discovering the potential of data in various business domains, from medical data, through retail, HR, finance, aviation, real estate, and more. She deeply believes in the power of data in every area of life. Article writer, conference speaker, and privately – passionate dancer.