This is a very rough sketch of the city of Buenos Aires:
As the sketch shows, it’s a big blob of homes (VIVIENDAs), with an office-ridden downtown to the East (OFICINAS) and a handful of satellite areas.
The sketch, of course, lies. Here’s a map that’s slightly less of a lie:
Both maps are based on the 2011 land usage survey made available by the Open Data initiative of the Buenos Aires city government, more than 555k records assigning each spot to one of about 85 different use regimes. It’s still a gross approximation — you could spend a lifetime mapping Buenos Aires, rewrite Ulysses for a porteño Leopold Bloom, and still not really know it — but already one so complex that I didn’t add the color key to the map. I doubt anybody will want to track the distribution of points for each of the 85 colors.
Ridiculous as it sounds at first, I’d suggest we are using too much of the second type of graph, and not enough of the first. It’s already a commonplace that data visualizations shouldn’t be too complex, but I suspect we are overestimating what people wants from a first look at a data set. Sometimes “big blob of homes with a smaller downtown blob due East” is exactly the level of detail somebody needs — the actual shape of the blobs being irrelevant.
The first graph, needless to say, was created programmatically from the same data set from which I graphed the second. It’s not a difficult process, and the intermediate steps are useful on their own.
Beginning with the original graph above, you apply something like an smoothing brush to the data points (or a kernel, if you want to sound more mathematical); essentially, you replace the land use tag associated to each point with the majority of the uses in its immediate area, smoothing away the minor exceptions. As you’d expect, it’s not that there aren’t any businesses in Buenos Aires, it’s just that, plot by plot, there are more homes, and when you smooth everything out, it looks more like a blob of homes. This leads to an already much simplified map:
Now, one interesting thing about most peoples’ sense of space is that it’s more topological than metrical, that is, we are generally better at knowing what’s next to what than their absolute sizes and positions. Data visualizations should go with the grain of human perceptual and cognitive instincts instead of against them, so one fun next step is to separate the blobs — contiguous blocks of points of the same (smoothed out) land use type — from each other, and show explicitly what’s next to what. It looks like this:
Nodes are scaled non-linearly, and we’ve filtered out the smaller ones, but we’ve already done programmatically something that we usually leave to the human looking at a map. We’ve done a napkin sketch of the city, much as somebody would draw North America as a set of rectangles with the right shared frontiers, but not necessarily much precision in the details. It wouldn’t do for a geographical survey, but if you were an extraterrestrial planning to invade Canada, it would provide a solid first understanding of the strategic relevance of Mexico to your plans. From that last map to the first one, it’s only a matter of remembering that you don’t really care, at this stage, about the exact shape of each blob, just where they stand in relationship to each other. So you replace the blogs with the appropriate land use label, and keep the edges between them. And presto, you have a napkin map.
Yes, on the whole the example is rather pointless. Cities are actually the most over-mapped territories on the planet, at both the formal and informal level. Manhattan is an island, Vatican City is inside Rome, the Thames goes through London… In fact, the London Tube Map has become a cliche example about how to display information about a city in terms of connections instead of physical distance. Not to mention that a simplification process that leaves most of the city as a big blob of homes is certainly ignoring more information that you can afford to, even in an sketch.
Not that we usually do this kind of sketching, at least in our formal work with data. We are almost always cartographers when it comes to new data sets, whether geographical, spatial in a general sense, or just mathematically space-like. We change resolution, simplify colors, resist the temptation of over-using 3D, but keep it a “proper” map. Which is good; the world is complex enough for us not to do the best mapping we can.
However, once you automate the process of creating multiple levels of simplification and sketching as above, you’ll probably find yourself at least glancing at the simplest (over)simplifications of your data sets. Probably not for presentations to internal or external clients, but for understanding a complex spatial data set, particularly if it’s high-dimensional, beginning with an over-simplified summary and then increasing the complexity is in fact what you’re already going to do in your own mind, so why not use the computer to help you out?
Originally posted at blog.rinesi.com