When Less is More: A Brief Story About Feature Engineering with XGBoost When Less is More: A Brief Story About Feature Engineering with XGBoost
I played a minor role launching RAPIDS on Google Dataproc by refining a model that predicts taxi fare in New York City. Geographic... When Less is More: A Brief Story About Feature Engineering with XGBoost

I played a minor role launching RAPIDS on Google Dataproc by refining a model that predicts taxi fare in New York City. Geographic location of passenger pick-ups and drops-offs were columns in the data. These are recorded as longitude and latitude measurements, with precision to many decimal places. One of the first things I did to improve the model’s performance was to round the value of longitude and latitude to make it less precise. What’s going on here? Isn’t more information supposed to make a model better? To understand what’s happening, we need to think a little bit about what the geolocation data represents, and understand how XGBoost decides to split up data when it is making predictions.

[Related Article: Feature Engineering with Forward and Backward Elimination]

Coarse Behavior

I had recently seen a tweet a commenting on how geolocation data is often stored at too precise a level for a lot of purposes. Possibly due to the Baader-Meinhof effect, I was immediately interested in the geolocation data in this problem.

I learned that one degree of latitude is 67 miles, or 353,760 feet. The numbers in the data are reported to the fifth decimal place, e.g. -73.99477. That means that the last digit is in units of 3.5 feet. It’s a real miracle of science that we can measure the taxi pick-ups from space this closely; a half car length is about as far as I ever want to walk for a cab. However, this seems like we’re looking too closely to see any useful patterns.

Based on the handful of visits I’ve made to New York, it does seem like traffic can vary from block to block. This blog suggests that “a north-south block in Manhattan runs approximately 264 feet,” and that an east-west block is about 750 feet.

If we round our geographic data to three decimal places, that means our smallest unit will be 353 feet; a little bit longer than a short block, and a little less than half as long as a long block. It seems like splitting a block is unhelpful, so we’ll pull back a little more. Based on model performance, I rounded the longitude and latitude numbers to two decimal places, which makes our smallest unit ⅔ of a mile. You can also imagine this as a 100 by 100 grid over New York City.

This small change alone reduced the RMSE by 4.6%.

What Really Matters?

What is the real unit that we think matters when it comes to a taxi ride? My two guesses were units of one block (three decimal places) and units of one neighborhood (two decimal places, a little less than a square mile). Testing revealed that the best results came from two decimal places.

Using the default values essentially amounts to dividing NYC into a 3.5 foot by 3.5 foot grid. At this resolution, we’ve introduced noise to this variable relative to our problem. XGBoost will make decisions for leaf splits at cut-offs at this 3.5 foot tolerance if the data isn’t modified. No reasonable person would believe that taking a couple of long steps in any given direction would change your taxi experience at all. Here, we’re asking the algorithm, “what is the impact of leaving from a particular ½ mile square of the city and arriving in another one?” The default number of bins XGBoost creates to discretize continuous variables is 256. By rounding, we’ve essentially reduced the bins to 100.

[Related Article: Feature Engineering for Time Series Analysis – ODSC East 2018]

Use Your Knowledge

XGBoost is a powerful algorithm. It can’t be used to its fullest by simply pointing it at data and carefully tuning hyperparameters. Good data science involves always keeping in mind the question you’re trying to answer. If you have the luxury of interpretable features in your model, you should incorporate your prior knowledge or best guess about the underlying causal mechanism you’re predicting when engineering features. Always ask yourself “is this data meaningful for the question I’m trying to answer?”

RAPIDS makes it easy to quickly iterate and try out new ideas. Finding the optimal resolution of the geographic data only took a couple of hours.

I hope you’ve enjoyed my feature engineering story, and I hope you check out our recent blog on getting up-and-running with Google Dataproc. What are some times you’ve seen big gains from seemingly simple feature engineering? Let me know in the comments below, or find me on twitter @realpaulmahler.

  1. I could not find this tweet, but I believe I saw it due to following John Murry (@murraydata).
  2. Unreasonable people, such as myself, can image Charles Addams/Edward Gorey situations where a couple of steps, say, out of the center of the road could help, but I assume these are rare cases.

Originally Posted on Medium.com by Paul Mahler