Anomalies. Oxford dictionary defines them as things that deviate from what is normal or expected. No matter what field you are in, they seem to pop up and occur without warning. In the realm of data, anomalies can lead to incorrect or out-of-date decisions to be made. This means we need to find them before they become too much of a problem! Whether you are just cleaning your data for analysis, monitoring the health of your computer systems, looking out for cybersecurity threats, or sifting through claims and transactions looking for fraud, anomalies drastically impact any analysis to be done. This is why we need data science to help us identify and flag anomalous data points before their impact is felt too much via anomaly detection.
A lot of anomaly detection systems are based on business rules. For example, a transaction is above a certain threshold (say $250) for a certain type of purchase (say online transactions). If a transaction is above this threshold then an alert is sent to either the purchaser or to an evaluator as a possible suspicious transaction. In cybersecurity, it might be that the system is pinged an abnormal number of times during times of day when customers are not expected to be on the website. Business rules are simple and easy to implement, but they are difficult to maintain and keep ahead of possible threats who might easily learn and discover these rules. I remember working on a project for a major bank, when we discovered that an individual realized that the bank would only charge a fee if he overdrafted by $10 or more. Consistently, the individual would overdraft by nine dollars and change. They learned the system and took advantage of it.
Data science can help evaluators and systems catch and flag anomalous activity with more precision and more efficiency in both operations and costs. This means that individuals investigating these data points are using their time more wisely and efficiently, leading to higher successes in stopping the impact of anomalies as well as customers feeling less impact themselves.
In my experience, there are two main groups of approaches for anomaly detection with data science – probability / statistical-based approaches and machine learning approaches. Both have their benefit!
Some common probability / statistical-based approaches are:
- Benford’s Law
- Z-scores and their robust version
- IQR rule and its adjustment
- Mahalanobis distances and their robust version
Some common machine learning approaches are:
- k-Nearest Neighbors (k-NN)
- Local Outlier Factor (LOF)
- Isolation forests
- Classifier-Adjusted Density Estimation (CADE) – one of my personal favorites!
- One-class Support Vector Machines (SVM’s)
Here is a brief summary of some of these!
Let’s start with some statistical-based approaches. Z-scores are a rather normal (pun intended) calculation for anomalies. They are derived from the normal distribution where we say that things that are more than three standard deviations from the mean are anomalies or outliers since around 99.7% of all data in a normal distribution is within three standard deviations from the average (or mean). This is easily calculated with the equation:
The downside of this calculation is that the mean (x bar above) and standard deviation (s above) are both bothered by anomalies in the data! Therefore, this calculation is impacted by the very anomalies it is trying to detect. This calculation is made more robust by switching out the mean for the median and the standard deviation for the median absolute deviation. The Mahalanobis distance (and its robust version) is just a multi-dimensional generalization of this technique to handle a cloud of data points (a scatterplot for example) instead of just a one-dimensional view.
The IQR rule was created in a similar vain. IQR stands for interquartile range. Any observation within the bounds of 1.5*IQR above the third quartile (75th percentile of your data) and 1.5*IQR below your first quartile (25th percentile of your data) is considered a “normal” observation. Anything outside of these bounds is flagged as an anomaly. This is great for a symmetrical distribution, but not so accurate for skewed ones. Luckily, there is an adjustment for this as well based off of medcouples. Medcouples aren’t a married pair of doctors, but a fancy name for scaled median difference for the left and right sides of a skewed distribution.
Now let’s look at some machine learning approaches! The k-nearest neighbors and local outlier factor algorithms are rather close in design. K-nearest neighbors is exactly what the name implies. It looks at every data point in a multi-dimensional space (again, think scatterplot) and measures their distance (typically Euclidean distance) to the k-nearest data points. For anomaly detection, we take the average of these distances to find data points that are “far away” from other data points on average. This approach is really good at finding what are called “global” anomalies or outliers, but not necessarily local ones. That is where the local outlier factor (LOF) comes into play. The LOF looks at the same distances that the k-NN approach looks at, but specifically focuses on the largest of these distances instead of the average. It essentially compares this largest distance (more specifically the area of a circle with this largest distance) to the average of the same measure for each of the k-nearest neighbors. The closer the data point is to the nearest neighbors, the closer these values. However, if a data point doesn’t look like its neighbors, it will be flagged as a local outlier.
solation forests are a more tree-based algorithm approach to anomaly detection. The basic idea is to slice your data into random pieces and see how quickly certain observations are isolated. You pick a random axis and random point along that axis to separate your data into two pieces. Then you repeat this process within each of the two pieces over and over again. This process is called an isolation tree. We don’t want observations to just get lucky though! So we do this same process many times to build an entire forest of trees! I know, nerds have fun with naming things. You have to bear with us. We take a measure of how quickly we can isolate each point on average across all the trees to build out a score for each data point. The points that are easiest to isolate are most likely to be anomalies.
There are so many more data science approaches to anomaly detection. I didn’t even cover all the ones listed above in this short blog post! If you would like to know more about these concepts as well as how to build out each of these concepts in Python and R, please attend my live training on an Introduction to Anomaly Detection and Fraud or take the training anytime on-demand. Looking forward to helping you find your anomalies!
About the author: A Teaching Associate Professor in the Institute for Advanced Analytics, Dr. Aric LaBarr is passionate about helping people solve challenges using their data. There he helps design the innovative program to prepare a modern workforce to wisely communicate and handle a data-driven future at the nation’s first Master of Science in Analytics degree program. He teaches courses in predictive modeling, forecasting, simulation, financial analytics, and risk management. Previously, he was Director and Senior Scientist at Elder Research, where he mentored and led a team of data scientists and software engineers. As director of the Raleigh, NC office he worked closely with clients and partners to solve problems in the fields of banking, consumer product goods, healthcare, and government. Dr. LaBarr holds a B.S. in economics, as well as a B.S., M.S., and Ph.D. in statistics — all from NC State University.