When doing any sort of data science problem, we will inevitably run into missing data.
Let’s say we’re interviewing 100 people and are recording their answers on a piece of paper in front of us. Specifically, one of our questions asks about income.
- Someone refuses to answer the question about income. Unbeknownst to us, this person’s income is low, and they do not feel comfortable sharing it.
- Someone else declines to answer the question about income. Perhaps this person is younger and young people are simply less likely to respond to certain questions.
- One of our subjects didn’t show up to the interview, so we haven’t observed any data for this person.
- At the conclusion of all 100 interviews, we stand up and accidentally spill our coffee. The coffee blurs the top of the page, rendering the first three rows of our data unreadable.
We may think we’re safe if we gather data from a computer… but not quite. What if we gather information from a sensor counting the cars passing through a toll road every hour, and the sensor breaks? What if a computer is collecting temperature data, but the temperature drops below the minimum value that can the computer can measure?
In a dataset, we’d see each of these missing values as something like an “NA.” However, these NAs were caused by very different things! As a result, the way we analyze data containing these missing values must be different.
So how do we do data science with missing data?
Well, as always: it depends.
To help us make a decision, we can use the “good, fast, cheap” diagram from project management:
- You can do a project that is done fast and cheaply… but it won’t be good.
- You can do a project that is good and is done cheaply… but it won’t be fast.
- You can do a project that is good and is done fast… but it won’t be cheap.
- It is basically impossible to have a project that can be done fast and cheaply and also be good.
The same idea applies to how we handle missing data!
We can handle missing data by just dropping every observation that contains a missing value.
- Our analysis is fast: In Python, it’s just one line of code!
- Our analysis is cheap: We don’t need additional money to do this.
- But it isn’t very good: By dropping all of our observations containing a missing value, we’re losing data and also making dangerous assumptions. Even more sophisticated techniques like replacing missing data with the mean or the mode will have dramatic, negative results on our analysis.
We can handle missing data by trying to avoid missing data up front.
- Our analysis is fast: When it gets to analyzing our data, we don’t have to do anything special because our data is already complete. This is effectively zero lines of code!
- Our analysis is good: We don’t have any uncertainty in our results if we truly collected 100% of the intended data.
- But it isn’t very cheap: Spending money to collect all of our intended data can be very, very expensive.
We can handle missing data by using sophisticated techniques such as the pattern submodel approach or multiple imputations.
- Our analysis is cheap: We don’t need to spend any additional money!
- Our analysis is good: We are properly estimating the uncertainty in our results or are foregoing imputation techniques altogether.
- But it isn’t very fast: Our analysis will be more involved and will likely take substantially longer.
Which approach is right for you? Well… it depends! How much time do you have to do your analysis? How much money do you have? What are the trade-offs comparing quality, time, and money?
Interested in hearing more? I’m looking forward to sharing more about missing data at the Open Data Science Conference in Boston on Tuesday, April 30 at 9:00 a.m.