Mapping Amazon, Wal-Mart & Target Warehouses: Which route would Santa take to visit every warehouse in the US?
Editor’s note: Opinions and views expressed in this post do not necessarily reflect the views of #ODSC
Let’s face it, if you’re American and you celebrate Christmas chances are you have visited Target, Wal-Mart or shopped at amazon.com this holiday season. Even if you haven’t shopped there you probably are getting something from one of these retailers…let’s hope it’s a good old fashioned data science book. The only other gift worth having is a ticket to the next ODSC!
Combined these retailers made $187B in Q4 last year. To support that much product moving around the country there has to be an extensive logistics network. I wanted to explore this unseen backbone of US consumption so I painstakingly gathered warehouse information from various sites. Not surprisingly complete, and unified data was hard to come by. I suspect each retailer wants to keep their distribution philosophy as secret as possible. That would avoid competing for the same labor pool in an area, defining the cost of real estate and distance to population centers. Still with some late nights I persevered and unified almost complete data, largely manually collecting and trying to cross validate in multiple sources.
In this post we will pretend Santa has to visit each of the warehouses so the Elves’ toys can be disseminated to all the children and good data scientists in the US. Along the way you will learn some basic mapping and routing techniques you can use the next time you have geospatial data.
This work is an example of “GIS” which stands for geospatial information systems. Admittedly, I am no expert in this field, but find it fascinating to visualize geospatial data. I like to use a basic package
maps during EDA for geospatial data. It has quick and easy, although somewhat unappealing map information for plotting. Then I load both
ggthemes which are among my favorite R visualization packages. Next, the
TSP package is a simple package for “traveling salesperson problems.” This type of problem is one where you are trying to minimize the distance traveled from among multiple points in space. As you can imagine this is complex and has real world implications. Delivery companies, or in our case, Santa, do not want to waste effort by picking a suboptimal route. In this post, we suppose Santa is like a traveling salesman that has to efficiently travel between Amazon, Target and Wal-Mart warehouses because he only has a day to do it.
Lastly, I change the number of decimals that R will print to be 15 using
options. When working with specific latitude and longitude values it is helpful during EDA. I also use
set.seed so your code will create the same visualizations as I do for the TSP map.
Next, go get the data here and load it using
read.csv. When parsing the data frame the vector classes are incorrect so I use
as.numeric to change the
$sq_ft. You may want to change the others as you explore the data more.
Exploring the Warehouse Locations
It’s always a good idea to perform basic EDA on a new data set. I usually start with
head to examine a few rows. This code will print the top 10 rows to your console but the default is 6. You will notice the column names and also some NA values. This data was challenging to obtain so in some cases I used NA or it was returned when I geocoded the location for lat/lon.
Since I collected multiple retailers you may be interested in the distribution among the companies. Using
table will tally the retailers. I also use
barplot in the second line to visualize the frequency of warehouses. Surprisingly Amazon has more warehouses than Wal-Mart. This despite the fact that Wal-Mart does about 4x the annual revenue. Clearly the organizations think about distribution efficiency differently.
Amazon has more warehouses even when adding Sam’s Club to the Wal-Mart.
A popular base function is
summary which can be applied to an entire data set. However, a more nuanced way to explore data is to summarize the features by another feature.
Summary can be applied in this manner using
tapply along with the vectors
tapply function applies summary to each cell of the intersections of the vectors. The result is a list summarizing the square footage of warehouses for each retailer.
tapply(warehouses$sq_ft, warehouses$retailer, summary)
Comparing the median values from the code above shows that Wal-Mart is committed to larger warehouses resulting less of a need compared to Amazon. Specifically, amazon’s median warehouse size is 449,000 compared to Wal-Mart’s 880,000.
To get more feature interactions the aggregate function can be helpful. This code will provide the mean average square foot by retailer and by warehouse type. In the second parameter pass in a list containing the features for subsetting.
aggregate(warehouses$sq_ft, by=list(warehouses$retailer,warehouses$type), FUN=mean,na.rm = T)
This code lets you compare the size of each retailer’s warehouse according to its function. For example the table is a portion of the aggregate result comparing average “food” distribution center size for the three retailers.
Again, you can see that Wal-Mart is committed to large warehouse centers by type.
During your EDA you should have observed multiple NA’s. Often geospatial functions fail with NA lat/lon values. A quick way to exclude them is with subset. Since a latitude that is NA will also have a longitude NA this single line of code will remove any rows that meet the logical declaration for either lat or lon. The
complete.cases function is my declaration and returns a TRUE if the row contains no NA values.
To get a quick EDA map visual I like to use the
map function. First plot the line map and specifying “state” to get the lower 48. Next add points to represent each fulfillment center. The X values are longitude and the Y are latitude. Lastly, I tell R to plot 4 colors and in 4 different shapes. This is because there are 4 different retailers, Wal-Mart, Sam’s Club, Target and Amazon.
points(complete.latlon$lon,complete.latlon$lat, col =c('orange','blue','red','yellow'),pch = c(16,17,18,19))
The result isn’t pretty but during EDA I am less concerned with aesthetics. The point is that there are dense East coast and West coast clusters no matter the retailer. This type of quick map is used to find some insights but isn’t really presentation worthy.
A quick map showing a lot of Eastern warehouses and none in Idaho, Montana, Wyoming, and the Dakotas..
The Traveling Santa Problem
Santa will be busy and not have much time to get all these gifts to the warehouses around the US. To help him Mrs. Clause, a Kaggle Grand Master, knows that data science can help. To get started, the
ETSP function accepts and an X and Y vector within a matrix. This problem is a Euclidean Traveling Salesperson Problem (ETSP) since Santa can fly straight measured as Euclidean distance between points. In “reality” Earth is a sphere so a great circles approach is really what airlines would use. Remember these points represent each warehouse location and the goal is to to minimize the distance he has to travel so. The
solve_TSP function applies heuristics solutions to the problem of minimizing the distance traveled. Here I selected two_opt which essentially looks to minimize how often Santa would backtrack over his route. A good article on the approach is here. Another approach in
solve_TSP is “nn” which stands for nearest neighbor. In this case the function selects the nearest warehouse in turn but allows for Santa crossing over his path again.
santa.tour<-solve_TSP(latlon.m, method = 'two_opt')
To examine the result, print
santa.tour in your console. This will print the class and length of Santa’s trip but this is in Euclidean distances of lat/lon so it’s not that helpful. Instead use plot to see the entire route firsthand. In this example I did not specify a start= value. Technically Santa would be coming from the North Pole which is 90,0 but I didn’t bother adding it ☺
The basic Santa Tour plot isn’t very informative!
To get the order of warehouses that Santa will visit call
as.interger on the tour object. This extracts the index number from
santa.tour. In this example, Santa should start at row 81, 80, 82 and so on. You can see the order by calling
route.order in your console.
A quick way to see the first 6 routes on Santa’s journey is below. I am indexing
complete.latlon by the
1:6 values of
route.order. Instead of returning the entire column I am specifying the location vector with 4. The next table shows his first 6 stops of the night.
|1||3100 North-I 27, Plainview, TX, 79072|
|2||700 Westport Parkway, (Haslet) Fort Worth. Texas. 76177- 4513|
|3||15201 Heritage Parkway, Fort Worth, Texas, 76177-2517|
|4||15101 N. Beach St., Fort Worth, TX, 76177|
|5||5300 Westport Parkway , Fort Worth, TX, 76177|
|6||4601 Gold Spike Drive, Fort Worth, Texas, 76106|
The start of Santa’s Journey to all US Warehouses is in Texas..
An easy way to reorder the original data frame to account for the traveling salesperson solution is with
match. To start add a column,
$index, which has values 1 to the number of rows in the data set. This makes the row position a declared vector. Next, using
match within square brackets pass in the route order and
$index. Since the values are shared match will rearrange the rows into the efficient delivery tour.
Once the data is reordered you can compare the first 6 warehouse locations with the code below.
Since the basic plot wasn’t informative I want to create an actual map of Santa’s journey. First, load the US state map to get the spatial information with
map_data. This code will invoke the “maps” library to get the plotting data.
state.map <- map_data("state")
Next, create a base layer
ggplot. Then add a polygon layer containing the state level map data in
y=lat. I chose grey as the color so the state outlines would not add visual clutter. The next layer is
geom_path function will plot a line in the order of the data frame rows. From the package documentation,”
geom_path connects the observations in the order in which they appear in the data.” In contrast, “
geom_line connects them in order of the variable on the x axis. Since this is a linear journey I have to use
geom_path. The next plot layer uses
ylim to limit the viewable section. This can be useful if you have outlier lat/lon points. Next, I add a title with
ggtitle, apply my favorite plot theme,
theme_gdocs, and then remove a lot of the visual elements so you can focus on the plot and Santa’s journey. These last 4 layers, limits, title, theme and blank elements are not mandated by make the plot easier to follow.
geom_polygon(data = state.map,
colour = "grey", fill = NA) +
aes(x=lon, y=lat), col="red") +
xlim(-125,-65) + ylim(25,50) +
ggtitle("Santa Deliveries") +
theme(panel.border = element_blank(),
axis.text = element_blank(),
line = element_blank(),
axis.title = element_blank())
A route Santa could take to deliver goods to Target, Wal-Mart and Amazon warehouses.
You can trace Santa’s path starting in Texas and ending in Nevada. Remember this TSP method sought to minimize crossing paths but it was unavoidable in some cases. However, the result seems intuitive as Santa starts in Texas migrates to the lower Midwest then over to the Southeast. From there he starts up the East coast before heading West across the upper Midwest and then darting down to Southern California. Looping north he covers the Pacific Northwest before ending in Nevada. Those must be some tired Reindeer!
Now that Santa has made his rounds I want to know the nearest warehouse to my location so I can go pick up my goods…assuming I am on the nice list! In the next post I will show you how to geocode an address, find the closest warehouse to that address, get a route from google maps and then get a satellite image of the building that will have my holiday gifts from Santa’s elves. Lastly, I will create an interactive map with all warehouse information from Wal-Mart, Sam’s Club, Target, and Amazon so you can explore the data yourself.