Having clean, comprehensive, and consistent data is paramount to developing effective algorithms in machine learning. Without perfect data, you’re exposed to bias and skewed results that will lead to improper decision-making. When it comes to missing data, there are three major types: Missing Completely at Random (MCAR), Missing at Random (MAR), and MNAR (Missing Not at Random), and each has its own causes and remedies. Let’s learn more about the three types of missing data and how you can address the issue.
Missing Completely at Random (MCAR)
MCAR data means data that are independent of the observed and unobserved data. Essentially, MCAR data may come down to bad luck, such as damage to a physical device, missing surveys, or even simple glitches. Example: You’re analyzing surveys of household incomes but some of the surveys went missing. However, MCAR data can be inferred, as there’s enough data to make assumptions given the data that exists. We can infer data of one street based on the data from the next street.
Missing at Random (MAR)
When data are MAR, data are missing systematically and it’s related to the observed but not the unobserved data. MAR data goes a bit more granular than MCAR, as MAR is data that’s missing within a certain variable. If we use that household income example again, the MAR data would be only missing data from a certain profession, rather than the entire survey. This would create a bias towards or against certain professions in this case. MAR data can’t be assumed either, as there’s no information to infer the missing points from. We can’t infer the income of a certain profession as there’s no relatable point; we can’t assume a manager’s salary based on an entry-level salary.
Missing Not at Random (MNAR)
MNAR data is data that’s missing for an observable reason, aka it could be deliberate one way or another. Similar to MAR, we can’t infer data since we’re missing a specific variable. In the same example, this would mean the people who listed Manager as a job title deliberately didn’t answer a specific question. We know where the missing data comes from, but we can’t replicate it.
How to work with these types of missing data
Missing data isn’t a death sentence for your algorithms. It doesn’t have to be overly difficult, expensive, or time-consuming either. With the upcoming Ai+ Training session, “How to do Data Science with Missing Data” on September 21st with Matt Brems, Distinguished Faculty at General Assembly, you’ll learn how to do just that and work with these types of missing data. In this 4-hour immersive session, you’ll be able to do the following:
- Describe the impact of missing data using simulations and identify techniques for avoiding missing data and give specific examples of how to avoid missing data.
- Define unit and item missingness, and identify when they occur and implement weight class adjustments, and identify advantages and disadvantages of this technique.
- Define and give examples of data that are missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR), and describe a workflow for doing data science with missing data.
- Describe proper regression imputation and the pattern submodel method and select the best missing data technique given your situation and real-world constraints.
Don’t wait too long! This session is currently 30% off but it won’t be for long. Register here.