Predicting Airline Delays Predicting Airline Delays
I don’t know about all of you, but flying doesn’t always go smoothly for me. I have had some horror stories... Predicting Airline Delays

I don’t know about all of you, but flying doesn’t always go smoothly for me. I have had some horror stories I could tell you about weird delays I have encountered while flying. Wouldn’t it be nice to know how much your flight will probably be delayed and why?

Well, that’s what this project will attempt to do. Granted, the data scientists over at Hortonworks did a very similar project (and a well done one in my opinion!) just a few months ago. My project will be a little different from theirs in that instead of doing a classification problem (yes/no for a delayed flight), this will be a regression problem where I will try to predict the delay time in number of minutes (which can be negative). The regression model will not be restricted to a single city, so we are going to be working with a very large number of training examples!

To complete this project, we need some data about flights. Fortunately, the government keeps such a resource available that we are going to examine in this project.

Similar to the project about faculty salaries, this post will be split into two major parts: exploratory data analysis and feature engineering in R, with regression model implementation in Python.

Getting the Data

For this project, the best place to get data about airlines is from the US Department of Transportation, here. There are several options available for what data you can choose and which features. For this project, I chose the following features:

  • Year
  • Month
  • DayofMonth
  • DayofWeek
  • UniqueCarrier
  • OriginAirportID
  • DestAirportID
  • CRSDepTime (the local time the plane was scheduled to depart)
  • CRSArrTime (the local time the plane was scheduled to arrive)
  • ArrDelay (how early/late the plane was at its final destination in minutes: our target variable)
  • Distance (how far did the plane travel for the route)
  • CRSElapsedTime (the scheduled difference between departure and arrival)

I only wished to include features that a user could enter at any time. In theory, you could predict your flight delay for 6 months from now with this model. Because of that, I can’t include any time dependent features (such as, sadly for me, weather, which could have helped with this model’s accuracy). The Hortonworks example included weather data as an interesting augmentation to the model. However, for the way I want to set up this model, including weather data would be an example of data leakage.

As someone who has studied the weather for a very long time, trust me when I say the furthest out you can predict the weather at a specific location with any sort of accuracy is about a week. We can’t include daily max wind speeds, snowfall totals, etc. because there is no way of knowing that before you fly and the day has already ended!

Unfortunately, the way the Department of Transportation processes the downloading, you have to select which features you want and download them, but only for one month at a time (which is tedious and takes a while). Because of this, I have collated the data for you in one large file available on AWS (Amazon Web Services) S3 so you don’t have to deal with any of this! The data I chose is for a full year extending from November 2013 to October 2014 (when I did the programming for this project, not all of 2014 was available yet). I have also included a lookup table for the Airports and Carriers that will become useful later.

Now that we understand the basics of the data, let’s get it loaded into R (warning: the collated file is about 350 MB so downloading it may take a little while if you are following along!)

%load_ext rpy2.ipython
airportsDB <- read.csv('http://s3.amazonaws.com/jsw.dsprojects/AirlinePredictions/Airport_Lookup.csv',
                header = TRUE, stringsAsFactors = FALSE)
carriersDB <- read.csv('http://s3.amazonaws.com/jsw.dsprojects/AirlinePredictions/Carrier_Lookup.csv',
                        header = TRUE, stringsAsFactors = FALSE)
flightsDB <- read.csv('http://s3.amazonaws.com/jsw.dsprojects/AirlinePredictions/CombinedFlights.csv',
                        header = TRUE, stringsAsFactors = FALSE)

Let’s take a look at what our flightsDB dataframe contains to make sure there weren’t any issues.


1   1 2014     1            1           3             AA             12478
2   2 2014     1            2           4             AA             12478
3   3 2014     1            3           5             AA             12478
4   4 2014     1            4           6             AA             12478
5   5 2014     1            5           7             AA             12478
6   6 2014     1            6           1             AA             12478
1           12892          900         1225        13              385     2475
2           12892          900         1225         1              385     2475
3           12892          900         1225        NA              385     2475
4           12892          900         1225        59              385     2475
5           12892          900         1225       110              385     2475
6           12892          900         1225        -8              385     2475
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA

Looks like we can get rid of the first and last columns. We also don’t need the YEAR column either.

flightsDB <- subset(flightsDB, select = -c(X, YEAR, X.1))

It also appeared as if some of the ARR_DELAY values are NA. We can’t predict on these, and we have plenty of training examples anyway, so let’s drop the rows where our target variable is missing. How many are missing?

    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-112.00  -11.00   -3.00    7.48   11.00 1983.00  150482 

Well 150,000 missing rows does seem like a lot. How many flights are we dealing with?


[1] 5900609      11

We have 5.9 MILLION rows! Compared to that magnitude, 150,000 missing rows is a small percentage. There isn’t much we can do with these missing values, so we need to drop them. Let’s clean them out.

flightsDB <- na.omit(flightsDB)

Checking again just to make sure no other features have NA values . . .

Min.   : 1.000   Min.   : 1.00   Min.   :1.000   Length:5750127    
1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000   Class :character  
Median : 7.000   Median :16.00   Median :4.000   Mode  :character  
Mean   : 6.624   Mean   :15.74   Mean   :3.938                     
3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:6.000                     
Max.   :12.000   Max.   :31.00   Max.   :7.000                     
Min.   :10135     Min.   :10135   Min.   :   1   Min.   :   1  
1st Qu.:11292     1st Qu.:11292   1st Qu.: 925   1st Qu.:1119  
Median :12889     Median :12889   Median :1320   Median :1520  
Mean   :12671     Mean   :12672   Mean   :1325   Mean   :1499  
3rd Qu.:13931     3rd Qu.:13931   3rd Qu.:1720   3rd Qu.:1910  
Max.   :16218     Max.   :16218   Max.   :2359   Max.   :2359  
Min.   :-112.000   Min.   : 15.0    Min.   :  31  
1st Qu.: -11.000   1st Qu.: 84.0    1st Qu.: 361  
Median :  -3.000   Median :116.0    Median : 626  
Mean   :   7.484   Mean   :136.7    Mean   : 797  
3rd Qu.:  11.000   3rd Qu.:167.0    3rd Qu.:1024  
Max.   :1983.000   Max.   :675.0    Max.   :4983  

Looks like we are okay for missing values. Now that we have loaded our data and cleaned it a little bit, let’s do some feature engineering.

Creating a New Feature: Number of Days to Holiday

This feature was actually included in the Hortonworks example I linked to previously, and I thought it was a good idea. It makes intuitive sense that airlines are probably going to be under stress near holidays, so this feature could probably be a decent predictor of a late flight.

First, we need all of the exact dates for holidays inside our range of flights (November 2013-October 2014). I had to look several of them up because they change every year (like Thanksgiving for example). Let’s create a list of holiday dates and convert them to the DATE type in R.

holidays <- c('2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26',
            '2014-07-04', '2014-09-01', '2014-10-13', '2013-11-11',
            '2013-11-28', '2013-12-25') # Ten major holidays, including Memorial Day, Columbus Day, Labor Day, MLK Day
                                            # Veteran's Day, New Year's, President's Day, Independence Day, Thanksgiving, 
                                            # and Christmas Day.
holidayDates <- as.Date(holidays)

To make things easier, let’s create a function to calculate the difference between each of our flight dates and the nearest holiday. We got rid of the YEAR feature earlier as we don’t need it for prediction or analysis, but we will need to include a provision for it in our function to find the exact number of days between any given date and a holiday.

DaysToHoliday <- function(month, day){ # Input a month and day from the flightsDB
    # Get our year.
    year <- 2014
    if (month > 10){
    year <- 2013
    # Paste on a 2013 for November and December dates.
    currDate <- as.Date(paste(year,month,day,sep = '-')) # Create a DATE object we can use to calculate the time difference
    numDays <- as.numeric(min(abs(currDate-holidayDates))) # Now find the minimum difference between the date and our holidays
    return(numDays)                                        # We can vectorize this to automatically find the minimum closest
                                                             # holiday by subtracting all holidays at once

We COULD run this function on every one of our 5.9 million rows, but that would probably be quite slow. If were working on a cluster like in the Hortonworks example, you could probably get away with using Pig like they did to do it.

What would be a much faster and more creative way to solve this problem (since we are not using HDFS or a cluster here) is to create a lookup table that we can access instead. There are only 365 different days in the year, so we actually only need to run the function 365 times instead of 5.9 million. Then, save the results in a table we use later. Much better idea! This trick is often used in programming, and it is called hashing to speed up a program if a lot of the computation is redundant.

So, let’s create our own hash table to speed things up by quite a bit.


datesOfYear <- unique(flightsDB[,1:2]) # Get all of the dates through unique Month/Day combinations
datesOfYear$HDAYS <- mapply(DaysToHoliday, datesOfYear$MONTH, datesOfYear$DAY_OF_MONTH) 
# Apply our function in a vectorized manner via one of R's many "apply" functions (in this case mapply)
# to each unique date and save                                        

Let’s take a look at our hash table to see that it worked.

1     1            1     0
2     1            2     1
4     1            4     3
5     1            5     4
6     1            6     5
7     1            7     6

Similar to before, we want to create a small function that we can then vectorize using mapply. Let’s create a function to read from our hash table.

InputDays <- function(month,day){
    finalDays <- datesOfYear$HDAYS[datesOfYear$MONTH == month & datesOfYear$DAY_OF_MONTH == day] # Find which row to get

Now let’s run this much simpler lookup on all of our rows and generate the feature. If you are using this interactively, it will probably take 5-10 minutes, depending on your computer performance.

flightsDB$HDAYS <- mapply(InputDays, flightsDB$MONTH, flightsDB$DAY_OF_MONTH)

Let’s check our new feature quickly.

1     1            1           3             AA             12478
2     1            2           4             AA             12478
4     1            4           6             AA             12478
5     1            5           7             AA             12478
6     1            6           1             AA             12478
7     1            7           2             AA             12478
Jesse Steinweg-Woods

Jesse Steinweg-Woods

I am currently a Senior Data Scientist at tronc (Tribune Publishing online content), working on recommender systems for articles and better understanding our readers at great papers such as the Los Angeles Times and Chicago Tribune. I prefer open-source tools (especially Python).