The Un-Sexy Data Science: Data Cleaning The Un-Sexy Data Science: Data Cleaning
The Data Science profession, while often misunderstood (even by the field itself — a topic for another post possibly), was famously... The Un-Sexy Data Science: Data Cleaning

The Data Science profession, while often misunderstood (even by the field itself — a topic for another post possibly), was famously labeled the “sexiest job of the 21st century” by the Harvard Business Review in October 2012. Since that time, businesses, education providers, and potential talent have all been jumping over themselves to cash in on the gravy train. I mean, with terms like Artificial Intelligence (AI), Machine Learning (ML), Predictive Modelling, Neural Networks, who wouldn’t get excited right?! What the glossy brochures won’t tell you about data science is that the fabled Data Scientist will spend hours upon hours doing the very unsexy task of… wait for it… data cleaning.

Data cleaning doesn’t involve a literal wash ‘n vac, but the mundane process of tidying up data that has either been identified to be wrong (“dirty data”), or missing. Rather than moan about data cleaning and try to convince you with words alone, we’ll work with an example from the NFL Big Data Bowl currently being run on Kaggle. The competition challenges competitors to build a model that can best predict “How many yards will an NFL player gain after receiving a handoff?”

[Related article: The Data Scientist’s Holy Grail – Labeled Data Sets]

The Data

The data that we will use for data cleaning was provided by Next Gen Stats and contains tracking data for running plays. Each row in the dataset corresponds to a single player’s involvement in the play, so one play will have 22 records (11 players per team). There are 509,762 rows and 49 variables (columns), ranging from each player’s position, speed, acceleration, height, and weight, to stadium details, weather details, formations, and game details.

The Data Science Workflow

So once the Data Scientist has collected the data (this example isn’t all that representative as we were handed the data), the first step would be to usually perform what’s know as EDA — Exploratory Data Analysis — to uncover certain patterns/relationships/quirks in the data. To perform the exploration and cleaning, packages from the tidyverse ecosystem will be used extensively. Packages such as the following will be used:

  • dplyr; data wrangling
  • stringr; handling strings (text)
  • ggplot2, scales; data visualization
  • lubridate; dealing with date-based variables

Categorical Variables

In this post, we will focus our attention on cleaning up the categorical variables in the dataset. Dealing with missing values is another key piece of work for any Data Scientist, but we will park that for now until a later post.

To select only categorical variables, the select_if() function makes things really easy. To inspect what we have, glimpse() provides arguably a more appealing output the str().

# to select just the categorical variables:
train_cat <- train %>% 

# quickly inspect the data

This post will limit itself to these three variables.

This post will only cover the cleaning of a few of the categorical features — going through all of the variables in this blog post may be a bit much to stomach for you the reader!

The first variable we’ll clean is the PlayerHeight variable. As a feature in a predictive model, we may find height as a categorical variable increases the complexity (hot-encoding, etc).

There are 16 different heights and also don’t sort correctly as categorical, ie “5–10” is before “5–6”…

As a result, we can convert the height currently displayed as feet-inches to inches, and then plot the continuous variable in a histogram:

# Function to convert to inches
convert_to_inches <- function(x) {
  (as.numeric(str_extract(x, "[^-]+")) * 12) + as.numeric(str_extract(x, "[^-]*$"))
# apply the function using mapply()
train_cat <- train_cat %>% 
  mutate(PlayerHeight = mapply(convert_to_inches, PlayerHeight))

By running the following code length(unique(train_cat$Stadium)) , it appears that there are 55 different stadiums… seems quite high given there are only 32 teams in the NFL. Let plot to see which stadiums have hosted games:

Ok I see what’s happened here. Through the data collection, some of the stadium names must have been altered/collected from different sources. For example, FirstEnergy Stadium has three different names; “FirstEnergyStadium”, “FirstEnergy Stadium” and “FirstEnergy”. The below function will clean those up so there is only one legitimate entry per stadium.

# function to clean stadiums
clean_stadiums <- function(stadium) {
  if(stadium == "Broncos Stadium At Mile High") {
    "Broncos Stadium at Mile High"
  } else if(stadium == "CenturyField" | stadium == "CenturyLink") {
    "CenturyLink Field"
  } else if(stadium == "EverBank Field") {
    "Everbank Field"
  } else if(stadium == "First Energy Stadium" | stadium == "FirstEnergy" | stadium == "FirstEnergyStadium") {
    "FirstEnergy Stadium"
  } else if(stadium == "Lambeau field") {
    "Lambeau Field"
  } else if(stadium == "Los Angeles Memorial Coliesum") {
    "Los Angeles Memorial Coliseum"
  } else if(stadium == "M & T Bank Stadium" | stadium == "M&T Stadium") {
    "M&T Bank Stadium"
  } else if(stadium == "Mercedes-Benz Dome") {
    "Mercedes-Benz Superdome"
  } else if(stadium == "MetLife" | stadium == "Metlife Stadium"){
    "MetLife Stadium"
  } else if(stadium == "NRG") {
    "NRG Stadium"
  } else if(stadium == "Oakland Alameda-County Coliseum") {
    "Oakland-Alameda County Coliseum"
  } else if(stadium == "Paul Brown Stdium") {
    "Paul Brown Stadium"
  } else if(stadium == "Twickenham") {
    "Twickenham Stadium"
  } else {
# apply function
train <- train %>% 
  mutate(Stadium = mapply(clean_stadiums, Stadium))

As a result of the change, we are now left with 37 unique stadiums — much better.

Strangely, WindSpeed appears as a categorical variable… let’s see why this is.

data cleaning

The last three entries should be for “WindDirection”…

Ah ha! We have some records that explicitly state “MPH” and even “gusts up to”. Additionally, it appears as though some WindDirection values have found their way in to the WindSpeed variable (“E”, “SE”, “SSW”). Given that it appears the most frequently used values are integers (numbers), we will go ahead and clean the variable up to result in integers only.

The first step was to handle the values that looked like they belong to the WindDirection variable. When isolating those records, it was found that the data collection process had, in fact, switched these two values around. This was reversed in the first mutate() call.

Then, the value “Calm” was replaced with 0 — you could assume that a calm wind speed would be near zero.

train_cat <- train_cat %>% 
  mutate(WindSpeedClean = ifelse(str_detect(WindSpeed, "E"), WindDirection, ifelse(str_detect(WindSpeed, "SE"), WindDirection, ifelse(str_detect(WindSpeed, "SSW"), WindDirection, WindSpeed)))) %>% 
  mutate(WindSpeedClean = str_replace(WindSpeedClean, "Calm", "0"),
         WindSpeedClean = tolower(WindSpeedClean))

Then we remove all variations of “mph”, including “MPh” and “MPH”, and any additional whitespace (either leading or trailing spaces).

Finally, where there was a range, the lower end was retained. For example, for the values with a wind speed of 10–20, the 10 was retained as the value, as was the “15” in “15 gusts up to 25”.

# remove "mph" and additional whitespace
train_cat$WindSpeedClean <- gsub("mph", "", train_cat$WindSpeedClean) %>% str_squish()
# take first number before the "-"
train_cat$WindSpeedClean <- str_extract(train_cat$WindSpeedClean, "[^-]+")
# take the first number before the first space
train_cat$WindSpeedClean <- str_extract(train_cat$WindSpeedClean, "[^ ]+")

Then we plot the data to make sure we’ve cleaned up the erroneous entries:

data cleaning
The last three entries should be for “WindDirection”…


If you’ve reached this point and are thinking to yourself “gee, that didn’t seem too bad, what’s he complaining about,” just know that the paltry three variables covered here were three of forty nine variables–not to mention how we’d deal with missing values, or feature engineering and selection BEFORE we can even start applying the sexy techniques that seem to draw all the attention.

Jason Zivkovic

I am a data scientist working for a transformative worldwide distributor of plumbing, waterworks, and HVAC-R products. My passion lies in telling stories through data and focusing on data visualization to bring those stories to life. In my spare time, I love performing analyses on various sporting questions that pop in my head. You can find these at my blog www.dontblamethedata.com.