Finding That Needle! Isolation Forests for Anomaly Detection
ModelingStatisticsanomaly detectionEurope 2021Isolation Forestposted by ODSC Community May 3, 2021 ODSC Community
One of the best parts of data science is that algorithms developed for one application turn up in other applications they were not originally designed for! This is very true in the world of fraud and anomaly detection. Many algorithms have their foundation elsewhere but find their usefulness in detecting rare events and data points. One great example of this would be isolation forests!
Isolation forests are a more tree-based algorithm approach to anomaly detection. The basic idea is to slice your data into random pieces and see how quickly certain observations are isolated. You pick a random axis and random point along that axis to separate your data into two pieces. Then you repeat this process within each of the two pieces. The process is repeated over and over again in each subsequent piece until there is only one data point left in that subset (or slice) of the data. This process is called an isolation tree.
We don’t want observations to just get lucky though! What if an observation just so happens to avoid being isolated because of the fact that our splits are done completely at random? So we do this same process many times to build an entire forest of trees! Tree-mendous name, I know! From there we take a measure of how quickly we can isolate each point on average across all the trees to build out a score for each data point. The points that are easiest to isolate are most likely to be anomalies.
Let’s see a quick example. Imagine we had two variables – Income and Coverage to Income Ratio – which we wanted to find anomalies in. Let’s load some necessary Python packages as we can see here:
We can plot these two variables in a scatter plot. Here we can see that there is a dense cloud of points with other points along the edge that might be anomalies.
We easily run the Python code for isolation forests on a dataframe we created between the two variables.
The IsolationForest function is all we need with the fit component of that function on the dataframe – here called df. The n_estimators option defines how many trees we want in our forest, which is 500 for this example. From there we get an isolation score using the score_samples component of the IsolationForest object. I have graphed the same plot as above, but this time with the circles sized by their isolation forest score.
The larger the circles, the more likely they are anomalies because they are easier to isolate. Not surprisingly, we can see the observations on the outside of the main data cloud are large, while the points right in the middle of the data cloud are rather small.
There are so many more data science approaches to anomaly detection. In fact, I will talk more about isolation forests as well as two other techniques – local outlier factor and classifier-adjusted density estimation (CADE) – at my talk “Finding the Needle! Modern Approaches to Fraud and Anomaly Detection” at ODSC Europe 2021. Please join me there!
A Teaching Associate Professor in the Institute for Advanced Analytics, Dr. Aric LaBarr is passionate about helping people solve challenges using their data. There he helps design the innovative program to prepare a modern workforce to wisely communicate and handle a data-driven future at the nation’s first Master of Science in Analytics degree program. He teaches courses in predictive modeling, forecasting, simulation, financial analytics, and risk management.