Detecting Outliers Detecting Outliers
In this context, outliers are data observations that are distant from other observations. There are a number of reasons why variability may exist in the data... Detecting Outliers

In this context, outliers are data observations that are distant from other observations. There are a number of reasons why variability may exist in the data that you are working on during your analysis. Outliers may cause serious problems in your efforts as a Data Scientist.

title author date
Detecting Outliers
Damian Mingle
06/10/2018

Preliminaries

# Load libraries
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

Create Data

# Simulate data
simulated_data, _ = make_blobs(n_samples = 255,
                  n_features = 3,
                  centers = 1,
                  random_state = 1)

# Make extreme values
simulated_data[0,0] = 99999
simulated_data[0,1] = 99999

Detect Outliers

Using EllipticEnvelope forces you to specify a contaimination parameter (the proportition of outliers you think are in the data) – a significant limitation to this approach.

# Build outlier detector
outlier_detector = EllipticEnvelope(contamination=.1)

# Fit outlier detector
outlier_detector.fit(simulated_data)

# Predict outliers
outlier_detector.predict(simulated_data)
array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1, -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1, -1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1, -1,  1])

If you are interested in learning about the manual method of detecting outliers look at this video:


 

Original Source

Damian Mingle

Damian Mingle

Damian Mingle is an American businessman, investor, and data scientist. He is the Founder and Chief Data Scientist of LoveToThink.org, a way for skilled professionals to contribute their expertise and empower the world’s social changemakers. Formerly, Damian was the Chief Data Scientist at Intermedix (an R1 company) where he was responsible for leading a team of international data scientists to drive business value. As a leading authority on data science, Damian speaks nationally and internationally on patient safety, global health, and applied data science.