A real-world dataset often contains anomalies or outlier data points. The cause of anomalies may be data corruption or experimental or human errors. The presence of anomalies may impact the performance of the model, hence to train a robust data science model, the dataset should be free from anomalies.
In this article, we will discuss 5 such anomaly detection techniques and compare their performance for a random sample of data.
What are Anomalies?
Anomalies are data points that stand out amongst other data points in the dataset and do not confirm the normal behavior in the data. These data points or observations deviate from the dataset’s normal behavioral patterns.
Anomaly detection is an unsupervised data processing technique to detect anomalies from the dataset. An anomaly can be broadly classified into different categories:
- Outliers: Short/small anomalous patterns that appear in a non-systematic way in data collection.
- Change in Events: Systematic or sudden change from the previous normal behavior.
- Drifts: Slow, undirectional, long-term change in the data.
Anomalies detection is very useful to detect fraudulent transactions, disease detection, or handle any case studies with high-class imbalance. Anomalies detection techniques can be used to build more robust data science models.
How to Detect Anomalies?
Simple statistical techniques such as mean, median, and quantiles can be used to detect univariate anomalies feature values in the dataset. Various data visualization and exploratory data analysis techniques can also be used to detect anomalies.
In this article, we will discuss some unsupervised machine learning algorithms to detect anomalies, and further compare their performance for a random sample dataset.
1. Isolation Forest
2. Local Outlier Factor
3. Robust Covariance
4. One-Class SVM
5. One-Class SVM (SGD)
Isolation Forest is an unsupervised anomaly detection algorithm that uses a random forest algorithm (decision trees) under the hood to detect outliers in the dataset. The algorithm tries to split or divide the data points such that each observation gets isolated from the others.
Usually, the anomalies lie away from the cluster of data points, so it’s easier to isolate the anomalies compare to the regular data points.
(Image by Author), Partitioning of Anomaly and Regular data point
From the above-mentioned images, it can be observed that the regular data points require a comparatively larger number of partitions than an anomaly data point.
The anomaly score is computed for all the data points and the points anomaly score > threshold value can be considered as anomalies.
Scikit-learn implementation of Isolation Forest algorithm
Local Outlier Factor:
Local Outlier Factor is another anomaly detection technique that takes the density of data points into consideration to decide whether a point is an anomaly or not. The local outlier factor computes an anomaly score called anomaly score that measures how isolated the point is with respect to the surrounding neighborhood. It takes into account the local as well as the global density to compute the anomaly score.
(Source), Local Outlier Factor Formulation
Scikit-learn implementation of Local Outlier Factor
For gaussian independent features, simple statistical techniques can be employed to detect anomalies in the dataset. For a gaussian/normal distribution, the data points lying away from 3rd deviation can be considered as anomalies.
For a dataset having all the feature gaussian in nature, then the statistical approach can be generalized by defining an elliptical hypersphere that covers most of the regular data points, and the data points that lie away from the hypersphere can be considered as anomalies.
Scikit-learn implementation of Robust Covariance using Elliptic Envelope
One Class SVM:
A regular SVM algorithm tries to find a hyperplane that best separates the two classes of data points. For one-class SVM where we have one class of data points, and the task is to predict a hypersphere that separates the cluster of data points from the anomalies.
Scikit-learn implementation of One-Class SVM
One Class SVM (SGD):
One-class SVM with SGD solves the linear One-Class SVM using Stochastic Gradient Descent. The implementation is meant to be used with a kernel approximation technique to obtain results similar to
sklearn.svm.OneClassSVM which uses a Gaussian kernel by default.
Scikit-learn implementation of One-Class SVM with SGD
The 5 anomaly detectors are trained on two sets of sample datasets (row 1 and row 2).
(Image by Author), Performance of 5 anomaly detection algorithms with a toy dataset
One-class SVM tends to overfit a bit, whereas the other algorithms perform well with the sample dataset.
Anomaly detection algorithms are very useful for fraud detection or disease detection case studies where the distribution of the target class is highly imbalanced. Anomaly detection algorithms are also to further improve the performance of the model by removing the anomalies from the training sample.
Apart from the above-discussed machine learning algorithms, the data scientist can always employ advanced statistical techniques to handle the anomalies.
 Scikit-learn documentation: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html
Article originally posted here by Satyam Kumar. Reposted with permission.