K-Means Clustering Applied to GIS Data
ToolsClusteringMachine Learningposted by Spencer Norris, ODSC October 11, 2018 Spencer Norris, ODSC
Here, we use k-means clustering with GIS Data. GIS can be intimidating to data scientists who haven’t tried it before, especially when it comes to analytics. On its face, mapmaking seems like a huge undertaking. Plus esoteric lingo and strange datafile encodings can create a significant barrier to entry for newbies.
There’s a reason why there are experts who dedicate their careers strictly to GIS and cartography. However, that doesn’t mean it’s completely inaccessible to the layman. Point in fact, most GIS tools make it very easy to create gorgeous maps. I made this map in five minutes using QGIS and public data from United States Geological Survey’s Wind Turbine Database.
Each point is a wind turbine, encoded as a GeoJSON object. And all I had to do was drag and drop the GeoJSON file into the QGIS GUI.
The extra hurdle for many practitioners is how to apply analysis techniques to GIS data, but it’s surprisingly straightforward. The key insight here is GIS data typically boils down to projections on a transformed space, which you can plot in two dimensions. In other words, it’s just two axes and two continuous variables, which makes it incredibly easy to adapt our existing machine learning and data mining methods to this space.
If you’ve built a machine learning pipeline in the past, the hardest part will be setting up the infrastructure that you’ve already built a hundred times. Your features are just the coordinates and whatever other columns you want to append to your data.
To illustrate this point, I ran K-means clustering against the dataset used to create the map above, then plotted the points. Feel free to lift this code!
import json from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt from matplotlib.pyplot import figure with open('uswtdb_v1_1_20180710.geojson') as f: data = json.load(f)
coordinates = [feature['geometry']['coordinates'] for feature in data['features']] coordinates = np.array(coordinates) #Train model kmeans = KMeans(n_clusters=5) kmeans.fit(coordinates) #Plot clusters figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k') plt.scatter(coordinates[:, 0], coordinates[:, 1], c=kmeans.predict(coordinates), s=50, cmap='viridis') centers = kmeans.cluster_centers_ plt.xlim(min(coordinates[:,0]) - 10, -50) plt.scatter( centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5 );
I admittedly committed a cardinal sin in this code snippet; you’re never supposed to test an algorithm against the data used to train it. However, since I’m just trying to discover clusters on a limited dataset and am not using this for future predictions (and since this is only a small demo), this is fine for our purposes.
Notice that the GeoJSON file can be read as JSON — because that’s all it is. GeoJSON is just a particular expression of standard JSON objects that encodes points and shapes. However, you can use your standard JSON libraries to parse and analyze it, as well as append new attributes to each of the objects.
Be aware of map types
There’s one pitfall you should be wary of if you decide to try machine learning with GIS data. The projection of the map for which the data was encoded may be an important factor to accuracy-sensitive applications.
Not all maps are the same: the Mercator Projection is the topography we’re most familiar with since every user-facing mapping application uses it. Whenever you open Google Maps, you’re looking at Gerardus Meractor’s view of the world.
However, if data is encoded using a different projection, it will affect the coordinates of individual points and the geometry of shapes. This can have dramatic effects on the outcome of machine learning applications in practice. Ultimately, whatever projection you decide to use will depend on your application — just don’t expect someone with a different map to get the same results.
Machine learning on geographic data is very simple and can open new possibilities for the enterprising practitioner. Give it a shot and see what you can find out using publicly-available datasets from data.gov or other asset collections.
Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!