fbpx
4 Techniques To Deal With Missing Data in Datasets 4 Techniques To Deal With Missing Data in Datasets
Missing data is a problem for every data scientist as we may not be able to carry out the analysis we... 4 Techniques To Deal With Missing Data in Datasets

Missing data is a problem for every data scientist as we may not be able to carry out the analysis we desire or not run a certain model. In this article, I will discuss simple methods that deal with missing values. However, to preface, there is no ‘official’ best way to deal with null data. Typically, the best way to handle this scenario is to understand where the data comes from and what it means. This is referred to as domain knowledge. Nevertheless, let’s begin.

In this article, we will be using the famous and amazing Titanic dataset. I am sure you have all heard of it. The dataset is as follows:

import pandas as pd
data = pd.read_csv('test.csv')
data.info()

Image showing ten columns: passengerid, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked. Also shows the non-null count and dtype.

Image by author.

data.isnull().sum()

Image by author.

As we can see, the missing data is only in the ‘Age’ and ‘Cabin’ columns. These are float and categorical data types respectively, so we have to handle the two columns differently.

1. Delete the Data

The easiest method is to just simply delete the whole training examples where one or several columns have null entries.

data = data.dropna()
data.isnull().sum()

Image by author.

There are now no null entries! However, there is no free lunch. Take a look at how many training examples are left:

Image by author.

There are only 87 examples left! Originally there were 418, therefore we have reduced our dataset by around 80%. This is far from ideal, but for other datasets, this approach could be very reasonable. I would say a maximum reduction of 5% would be fine otherwise you may lose valuable data that will affect the training of your model.

2. Imputing Averages

The next method is to assign some average value (mean, median, or mode) to the null entries. Let’s take a look at the following snippet from the data:

data[100:110]
Detailed information about ten columns in previous image

Image from author.

For the ‘Age’ column, the mean can be computed as the following:

data.fillna(data.mean(), inplace=True)

Image from author.

The average age of 30 has now been added to the null entries. Notice, for the ‘Cabin’ column the entries are still NaN as you can’t calculate the mean for an object datatype as it’s categorical. This can be fixed by computing its mode:

data = data.fillna(data['Cabin'].value_counts().index[0])

Image by author.

3. Assign New Category

In regards to the ‘Cabin’ feature, it only has 91 entries, which is about 25% of the total examples. Therefore, the mode value that we previously calculated is not very reliable. A better way is to assign these NaN values their own category:

data['Cabin'] = data['Cabin'].fillna('Unkown')

Image by author.

As we no longer have any NaN values, machine learning algorithms can now use this dataset. However, it will use the ‘Unknown’ unique value in the ‘Cabin’ column as its own category even though it never existed on the Titanic.

4. Certain Algorithms

The final technique is to do nothing. The majority of machine learning algorithms do not work with missing data. On the other hand, algorithms as K-Nearest Neighbor, Naive Bayes, and XGBoost all work with missing data. There is much literature online about these algorithms and their implementation.

Conclusion

There are many ways to deal with missing data. Certain methods are better than others depending on the type of data and the amount that is missing. There are also more complicated ways to input missing data that I have not covered here, but these options are great options to get you started.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1