Data visualization is an integral part of the data science process. “Data viz” plays an important role early in the process with exploratory data analysis (EDA) and also at the end with data storytelling for representing results for enterprise decision makers. In this article, we’ll review how geospatial plots make a big difference in interpretability of geographical data and work to inspire insights.
The application I’ll describe involves a client project I worked on for the Los Angeles fashion industry. It required an analysis of new business starts over a number of years in a well-defined geographical area in and around the Los Angeles fashion district. I did some public data hacking to come up with a viable data source using the Los Angeles Open Data repository. The idea was to determine how new businesses arose over time and in what areas in the district.
For this project, I opted to use the ggmap R package for geospatial data visualization. ggmap is a useful tool which enables visualization by combining the spatial information of static maps from Google Maps, OpenStreetMap, Stamen Maps, or CloudMade Maps with the layered grammar of graphics implementation of the popular ggplot2 R visualization package.
Be sure to check out a very well written research article appearing in The R Journal, ggmap: Spatial Visualization with ggplot2, by the authors of ggmap, David Kahle and Hadley Wickham. This article does a great job of introducing the package along with motivations for its design.
The basic idea behind ggmap is to take a downloaded map image, plot it as a context layer using ggplot2, and then plot additional content layers of data, statistics, or models on top of the map. In ggmap, this process is broken into two parts:
- Use get_map to download the images and format them for plotting.
- Use ggmap to make the plot.
Plotting with ggmap
In this section, I’ll present some R code for realizing a geospatial plot. Bear in mind that the project entailed a fairly detailed wrangling step (beyond the scope of this article) which prepared the data set for use in R, and also ggmap.
One of the requirements for geospatial data visualizations is to have longitude and latitude values for locations on the map. Fortunately, the data set I obtained from the open government resource included both address and longitude/latitude values. Some datasets only have an address, so you’d need to take the extra step during the data wrangling stage to determine longitude and latitude based on the street address.
The plot is for several years’ worth of data, but it would be easy to do the same for single years, or different year combinations.
# Women’s clothing stores in Los Angeles
LA <- get_map(‘los angeles’, zoom = 12)
LAMap <- ggmap(LA, extent = ‘device’, legend = ‘topright’)
stat_density2d(aes(x = Longitude, y = Latitude,
fill = ..level.., alpha = ..level..),
size = 0.5, data = LAFashion_DTLA_women, geom = ‘polygon’) +
scale_alpha(range = c(.4, .75), guide = FALSE) +
guides(fill = guide_colorbar(barwidth = 1.5, barheight = 10))
In the plot below, ggmap computes contours based on the geospatial data in the data set. It uses color shadings to represent the density of the women’s apparel businesses (identified by NAICS code) in different geographical areas. Here, light blue represents the highest density, and dark blue the lowest. Also included is a legend showing the shading ranges, the highest density aligns with 120 businesses, and the lowest density with 30 businesses.
Geospatial contour map visualization for women’s apparel businesses in Los Angeles
More Traditional Data Visualizations
A more traditional visualization for the data set used in the geospatial representation above, would be stacked bar charts.
# Horizontal stacked bar plot: NAICS by Start Year
counts <- table(year(LAFashion_DTLA$LOCATION.START.DATE), LAFashion_DTLA$PRIMARY.NAICS.DESCRIPTION)
counts_df <- as.data.frame(counts)
ggplot(data = counts_df, aes(x = Var2, y = Freq,
fill = factor(Var1))) +
geom_bar(stat=’identity’) + coord_flip() +
ylab(“Number of Businesses Started”) +
ggtitle(“LA Fashion Industry – NAICS by Year”)
Stacked bar representation of business starts for NAICS code by Year
I often choose to include geospatial data visualizations in my final data science project reports whenever appropriate. This class of visualization offers a unique perspective that is easy to interpret. Still, I also include traditional data visualizations as an adjunct to geospatial.
To learn more about Data Visualization, attend the workshop at ODSC west: