

Predicting Airline Delays
ModelingPredictive Analyticsposted by Jesse Steinweg-Woods December 17, 2017 Jesse Steinweg-Woods

I don’t know about all of you, but flying doesn’t always go smoothly for me. I have had some horror stories I could tell you about weird delays I have encountered while flying. Wouldn’t it be nice to know how much your flight will probably be delayed and why?
Well, that’s what this project will attempt to do. Granted, the data scientists over at Hortonworks did a very similar project (and a well done one in my opinion!) just a few months ago. My project will be a little different from theirs in that instead of doing a classification problem (yes/no for a delayed flight), this will be a regression problem where I will try to predict the delay time in number of minutes (which can be negative). The regression model will not be restricted to a single city, so we are going to be working with a very large number of training examples!
To complete this project, we need some data about flights. Fortunately, the government keeps such a resource available that we are going to examine in this project.
Similar to the project about faculty salaries, this post will be split into two major parts: exploratory data analysis and feature engineering in R, with regression model implementation in Python.
Getting the Data
For this project, the best place to get data about airlines is from the US Department of Transportation, here. There are several options available for what data you can choose and which features. For this project, I chose the following features:
- Year
- Month
- DayofMonth
- DayofWeek
- UniqueCarrier
- OriginAirportID
- DestAirportID
- CRSDepTime (the local time the plane was scheduled to depart)
- CRSArrTime (the local time the plane was scheduled to arrive)
- ArrDelay (how early/late the plane was at its final destination in minutes: our target variable)
- Distance (how far did the plane travel for the route)
- CRSElapsedTime (the scheduled difference between departure and arrival)
I only wished to include features that a user could enter at any time. In theory, you could predict your flight delay for 6 months from now with this model. Because of that, I can’t include any time dependent features (such as, sadly for me, weather, which could have helped with this model’s accuracy). The Hortonworks example included weather data as an interesting augmentation to the model. However, for the way I want to set up this model, including weather data would be an example of data leakage.
As someone who has studied the weather for a very long time, trust me when I say the furthest out you can predict the weather at a specific location with any sort of accuracy is about a week. We can’t include daily max wind speeds, snowfall totals, etc. because there is no way of knowing that before you fly and the day has already ended!
Unfortunately, the way the Department of Transportation processes the downloading, you have to select which features you want and download them, but only for one month at a time (which is tedious and takes a while). Because of this, I have collated the data for you in one large file available on AWS (Amazon Web Services) S3 so you don’t have to deal with any of this! The data I chose is for a full year extending from November 2013 to October 2014 (when I did the programming for this project, not all of 2014 was available yet). I have also included a lookup table for the Airports and Carriers that will become useful later.
Now that we understand the basics of the data, let’s get it loaded into R (warning: the collated file is about 350 MB so downloading it may take a little while if you are following along!)
%load_ext rpy2.ipython
%%R
airportsDB <- read.csv('http://s3.amazonaws.com/jsw.dsprojects/AirlinePredictions/Airport_Lookup.csv',
header = TRUE, stringsAsFactors = FALSE)
carriersDB <- read.csv('http://s3.amazonaws.com/jsw.dsprojects/AirlinePredictions/Carrier_Lookup.csv',
header = TRUE, stringsAsFactors = FALSE)
flightsDB <- read.csv('http://s3.amazonaws.com/jsw.dsprojects/AirlinePredictions/CombinedFlights.csv',
header = TRUE, stringsAsFactors = FALSE)
Let’s take a look at what our flightsDB dataframe contains to make sure there weren’t any issues.
%%R
head(flightsDB)
X.1 YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK UNIQUE_CARRIER ORIGIN_AIRPORT_ID
1 1 2014 1 1 3 AA 12478
2 2 2014 1 2 4 AA 12478
3 3 2014 1 3 5 AA 12478
4 4 2014 1 4 6 AA 12478
5 5 2014 1 5 7 AA 12478
6 6 2014 1 6 1 AA 12478
DEST_AIRPORT_ID CRS_DEP_TIME CRS_ARR_TIME ARR_DELAY CRS_ELAPSED_TIME DISTANCE
1 12892 900 1225 13 385 2475
2 12892 900 1225 1 385 2475
3 12892 900 1225 NA 385 2475
4 12892 900 1225 59 385 2475
5 12892 900 1225 110 385 2475
6 12892 900 1225 -8 385 2475
X
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Looks like we can get rid of the first and last columns. We also don’t need the YEAR column either.
%%R
flightsDB <- subset(flightsDB, select = -c(X, YEAR, X.1))
It also appeared as if some of the ARR_DELAY values are NA. We can’t predict on these, and we have plenty of training examples anyway, so let’s drop the rows where our target variable is missing. How many are missing?
%%R
summary(flightsDB$ARR_DELAY)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-112.00 -11.00 -3.00 7.48 11.00 1983.00 150482
Well 150,000 missing rows does seem like a lot. How many flights are we dealing with?
%%R
dim(flightsDB)
[1] 5900609 11
We have 5.9 MILLION rows! Compared to that magnitude, 150,000 missing rows is a small percentage. There isn’t much we can do with these missing values, so we need to drop them. Let’s clean them out.
%%R
flightsDB <- na.omit(flightsDB)
Checking again just to make sure no other features have NA values . . .
%%R
summary(flightsDB)
MONTH DAY_OF_MONTH DAY_OF_WEEK UNIQUE_CARRIER
Min. : 1.000 Min. : 1.00 Min. :1.000 Length:5750127
1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000 Class :character
Median : 7.000 Median :16.00 Median :4.000 Mode :character
Mean : 6.624 Mean :15.74 Mean :3.938
3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:6.000
Max. :12.000 Max. :31.00 Max. :7.000
ORIGIN_AIRPORT_ID DEST_AIRPORT_ID CRS_DEP_TIME CRS_ARR_TIME
Min. :10135 Min. :10135 Min. : 1 Min. : 1
1st Qu.:11292 1st Qu.:11292 1st Qu.: 925 1st Qu.:1119
Median :12889 Median :12889 Median :1320 Median :1520
Mean :12671 Mean :12672 Mean :1325 Mean :1499
3rd Qu.:13931 3rd Qu.:13931 3rd Qu.:1720 3rd Qu.:1910
Max. :16218 Max. :16218 Max. :2359 Max. :2359
ARR_DELAY CRS_ELAPSED_TIME DISTANCE
Min. :-112.000 Min. : 15.0 Min. : 31
1st Qu.: -11.000 1st Qu.: 84.0 1st Qu.: 361
Median : -3.000 Median :116.0 Median : 626
Mean : 7.484 Mean :136.7 Mean : 797
3rd Qu.: 11.000 3rd Qu.:167.0 3rd Qu.:1024
Max. :1983.000 Max. :675.0 Max. :4983
Looks like we are okay for missing values. Now that we have loaded our data and cleaned it a little bit, let’s do some feature engineering.
Creating a New Feature: Number of Days to Holiday
This feature was actually included in the Hortonworks example I linked to previously, and I thought it was a good idea. It makes intuitive sense that airlines are probably going to be under stress near holidays, so this feature could probably be a decent predictor of a late flight.
First, we need all of the exact dates for holidays inside our range of flights (November 2013-October 2014). I had to look several of them up because they change every year (like Thanksgiving for example). Let’s create a list of holiday dates and convert them to the DATE type in R.
%%R
holidays <- c('2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26',
'2014-07-04', '2014-09-01', '2014-10-13', '2013-11-11',
'2013-11-28', '2013-12-25') # Ten major holidays, including Memorial Day, Columbus Day, Labor Day, MLK Day
# Veteran's Day, New Year's, President's Day, Independence Day, Thanksgiving,
# and Christmas Day.
holidayDates <- as.Date(holidays)
To make things easier, let’s create a function to calculate the difference between each of our flight dates and the nearest holiday. We got rid of the YEAR feature earlier as we don’t need it for prediction or analysis, but we will need to include a provision for it in our function to find the exact number of days between any given date and a holiday.
%%R
DaysToHoliday <- function(month, day){ # Input a month and day from the flightsDB
# Get our year.
year <- 2014
if (month > 10){
year <- 2013
}
# Paste on a 2013 for November and December dates.
currDate <- as.Date(paste(year,month,day,sep = '-')) # Create a DATE object we can use to calculate the time difference
numDays <- as.numeric(min(abs(currDate-holidayDates))) # Now find the minimum difference between the date and our holidays
return(numDays) # We can vectorize this to automatically find the minimum closest
# holiday by subtracting all holidays at once
}
We COULD run this function on every one of our 5.9 million rows, but that would probably be quite slow. If were working on a cluster like in the Hortonworks example, you could probably get away with using Pig like they did to do it.
What would be a much faster and more creative way to solve this problem (since we are not using HDFS or a cluster here) is to create a lookup table that we can access instead. There are only 365 different days in the year, so we actually only need to run the function 365 times instead of 5.9 million. Then, save the results in a table we use later. Much better idea! This trick is often used in programming, and it is called hashing to speed up a program if a lot of the computation is redundant.
So, let’s create our own hash table to speed things up by quite a bit.
%%R
datesOfYear <- unique(flightsDB[,1:2]) # Get all of the dates through unique Month/Day combinations
datesOfYear$HDAYS <- mapply(DaysToHoliday, datesOfYear$MONTH, datesOfYear$DAY_OF_MONTH)
# Apply our function in a vectorized manner via one of R's many "apply" functions (in this case mapply)
# to each unique date and save
Let’s take a look at our hash table to see that it worked.
%%R
head(datesOfYear)
MONTH DAY_OF_MONTH HDAYS
1 1 1 0
2 1 2 1
4 1 4 3
5 1 5 4
6 1 6 5
7 1 7 6
Similar to before, we want to create a small function that we can then vectorize using mapply. Let’s create a function to read from our hash table.
%%R
InputDays <- function(month,day){
finalDays <- datesOfYear$HDAYS[datesOfYear$MONTH == month & datesOfYear$DAY_OF_MONTH == day] # Find which row to get
return(finalDays)
}
Now let’s run this much simpler lookup on all of our rows and generate the feature. If you are using this interactively, it will probably take 5-10 minutes, depending on your computer performance.
%%R
flightsDB$HDAYS <- mapply(InputDays, flightsDB$MONTH, flightsDB$DAY_OF_MONTH)
Let’s check our new feature quickly.
%%R
head(flightsDB)
MONTH DAY_OF_MONTH DAY_OF_WEEK UNIQUE_CARRIER ORIGIN_AIRPORT_ID
1 1 1 3 AA 12478
2 1 2 4 AA 12478
4 1 4 6 AA 12478
5 1 5 7 AA 12478
6 1 6 1 AA 12478
7 1 7 2 AA 12478
DEST_AIRPORT_ID CRS_DEP_TIME CRS_ARR_TIME ARR_DELAY CRS_ELAPSED_TIME DISTANCE
<sp