Good, Fast, Cheap: How to do Data Science with Missing Data Good, Fast, Cheap: How to do Data Science with Missing Data
When doing any sort of data science problem, we will inevitably run into missing data. Let’s say we’re interviewing 100 people and are recording... Good, Fast, Cheap: How to do Data Science with Missing Data

When doing any sort of data science problem, we will inevitably run into missing data.

Let’s say we’re interviewing 100 people and are recording their answers on a piece of paper in front of us. Specifically, one of our questions asks about income.

Consider a few examples of missing data:

  • Someone refuses to answer the question about income. Unbeknownst to us, this person’s income is low, and they do not feel comfortable sharing it.
  • Someone else declines to answer the question about income. Perhaps this person is younger and young people are simply less likely to respond to certain questions.
  • One of our subjects didn’t show up to the interview, so we haven’t observed any data for this person.
  • At the conclusion of all 100 interviews, we stand up and accidentally spill our coffee. The coffee blurs the top of the page, rendering the first three rows of our data unreadable.

We may think we’re safe if we gather data from a computer… but not quite. What if we gather information from a sensor counting the cars passing through a toll road every hour, and the sensor breaks? What if a computer is collecting temperature data, but the temperature drops below the minimum value that can the computer can measure?

In a dataset, we’d see each of these missing values as something like an “NA.” However, these NAs were caused by very different things! As a result, the way we analyze data containing these missing values must be different.

So how do we do data science with missing data?

Well, as always: it depends.

To help us make a decision, we can use the “good, fast, cheap” diagram from project management:

Even if you haven’t seen it before, the idea is pretty straightforward:

  • You can do a project that is done fast and cheaply… but it won’t be good.
  • You can do a project that is good and is done cheaply… but it won’t be fast.
  • You can do a project that is good and is done fast… but it won’t be cheap.
  • It is basically impossible to have a project that can be done fast and cheaply and also be good.

[Related article: Handling Missing Data in Python/Pandas]

 

The same idea applies to how we handle missing data!

We can handle missing data by just dropping every observation that contains a missing value.

  • Our analysis is fast: In Python, it’s just one line of code!
  • Our analysis is cheap: We don’t need additional money to do this.
  • But it isn’t very good: By dropping all of our observations containing a missing value, we’re losing data and also making dangerous assumptions. Even more sophisticated techniques like replacing missing data with the mean or the mode will have dramatic, negative results on our analysis.

 

We can handle missing data by trying to avoid missing data up front.

  • Our analysis is fast: When it gets to analyzing our data, we don’t have to do anything special because our data is already complete. This is effectively zero lines of code!
  • Our analysis is good: We don’t have any uncertainty in our results if we truly collected 100% of the intended data.
  • But it isn’t very cheap: Spending money to collect all of our intended data can be very, very expensive.

 

We can handle missing data by using sophisticated techniques such as the pattern submodel approach or multiple imputations.

  • Our analysis is cheap: We don’t need to spend any additional money!
  • Our analysis is good: We are properly estimating the uncertainty in our results or are foregoing imputation techniques altogether.
  • But it isn’t very fast: Our analysis will be more involved and will likely take substantially longer.

[Related article: From Pandas to Scikit-Learn — A New Exciting Workflow]

 

Which approach is right for you? Well… it depends! How much time do you have to do your analysis? How much money do you have? What are the trade-offs comparing quality, time, and money?

Interested in hearing more? I’m looking forward to sharing more about missing data at the Open Data Science Conference in Boston on Tuesday, April 30 at 9:00 a.m.

Matt Brems

Matt Brems

Matt currently is a global instructor for General Assembly's Data Science Immersive program across the United States. With General Assembly, also serves as the chair of their Data Science Product Advisory Board and has been selected to be a member of their "Distinguished Faculty" program. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master's degree in statistics from The Ohio State University and his undergraduate degree at Franklin College in Indiana. Matt is passionate about putting the revolutionary power of machine learning into the hands of as many people as possible. When he's not teaching, he works as a Managing Partner at ROC AUC, LLC, volunteers with Statistics Without Borders, and falls asleep to Netflix. Connect here: http://www.linkedin.com/in/matthewbrems