Handling Missing Data in Python/Pandas
ConferencesModelingPythonTools & LanguagesConferencesODSC WestPandasPythonposted by Daniel Gutierrez, ODSC November 21, 2018 Daniel Gutierrez, ODSC
- It’s important to describe missing data and the challenges it poses.
- You need to clarify a confusing terminology that further adds to the field’s complexity.
- You should take the time to review methods for handling missing data.
- You need to learn how to apply robust multiple imputation methods to a varied data set in Python/Pandas.
Alexandru Agachi presented a tutorial workshop on handling missing data in Python/Pandas at ODSC Europe 2018. Agachi is a Co-founder of Empiric Capital, an algorithmic, data-driven asset management firm headquartered in London. He is also a guest lecturer in big data and machine learning at Pierre et Marie Curie University in Paris, and is involved in neuro oncogenetic research, in particular applications of machine learning. After initial studies at LSE, he completed 4 graduate and postgraduate degrees and diplomas in technology and science, focusing on the thorium nuclear fuel cycle, surgical robotics, neuroanatomy and imagery, and biomedical innovation
[Related Article: Image Augmentation for Convolutional Neural Networks]
The tutorial was aimed at all data scientists and researchers trying to understand contemporary methodology for handling missing data in data sets. Curiously, the code presented in the talk was home-grown as there is no viable Python/Pandas library available for handling missing data. There is a GitHub repo for the workshop.
What is Missing Data?
“Values that are not available and that would be meaningful for analysis if they were observed,” by Little (2012)
When you have missing data for variables that are not of interest in your data science project, you can consider that you do not have missing data. The first step is hence to reduce your data set the relevant variables. Then you can start looking at missing data.
Slide copyright Alexandru Agachi, ODSC Europe 2018
Missing Data: Recurring Theme
Many data scientists working with enterprise data sets often are surprised (and frustrated) by the state of the data received for projects. Corporate data is often quite dirty because of years of poor data governance. One recurring theme is missing data. It is quite common for some of the observations to be incomplete in the sense that certain variables (features) are missing or are in error. A plan for dealing with missing data is needed. A brute force method is to discard an entire observation if it is incomplete. A better method is to infer the missing values based on the data from other observations. A common approach is to fill the missing data with the average or the median of the other data values. This is called imputing data values.
Missing data is a widespread challenge for data scientists. It is so common in practice that it is typically a prominent part of the project’s data pipeline as part of the Data Science Process in the data wrangling step. It is not a question of if, but rather how much data will be missing at the end of performing exploratory data analysis (EDA). Knowing how to handle missing data, therefore, is a crucial skill for data scientists.
Missing Data Handling Methods
“The missing data research field emerged through several articles published in the 1970s (Dempster et al 1977, Heckman, 1979, Rubin, 1976),” reported Agachi in his talk. “And it really took off with Rubin, and Little and Rubin’s seminal texts, both published in 1987. Since then, Little and Rubin’s additional contributions, along with those of Schafer, Allison, Graham and van Buuren, have completed the theory underpinning this subfield of statistics. In parallel, software advances from SPSS to Stata to R and more recently Python, now allow researchers to implement robust methods for handling missing data in their studies.”
[Related Article: GPU Dask Arrays, First Steps Throwing Dask and CuPy Together]
The statistical community has led the way toward handling missing data, with several widely accepted statistical methods, however, researchers from other fields have been slow to adopt them. In many cases, the solution to missing data is to delete all observations with missing data, and the most common advice in the data community is to perform single imputation i.e. replacing all missing values with the mean or mode for that variable. These methods are suboptimal and lead to both increased bias and a reduction in power for the study results.
To take a deeper dive into the importance of handling missing data in data science projects, check out Agachi’s full workshop talk from ODSC Europe 2018 below.