Scraping OpenStreetMap and exploring POI in Cloudant and Jupyter Notebooks
When working with data, the format of the raw data is not always user-friendly. For instance, the format could be one large binary file, or the data could spread across hundreds of text files. An easy way to solve this problem is to convert the data and store it in a database.
As an example of how to make working with data simpler, Raj Singh and I converted all the Points of Interest data from the global OpenStreetMap(OSM) project to GeoJSON files, which we then stored and are periodically updating in IBM Cloudant, a database service based on Apache CouchDB™. The data is now easily accessible through an API, which you can try for free.
OpenStreetMap is built by a community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world.
Read along to learn how we built it and how you can use the data. (Note: Should you reproduce all the work described below, you will likely incur costs for Cloudant.)
The first step is to download the most recent data for each continent. We used Geofabrik, which extracts, selects, and processes free geo data from OpenStreetMap. The examples that follow use data from Europe, but for complete global coverage, all steps need to be repeated for each continent.
Converting the Data
The second step is to extract the Points of Interest (POI) from this large file. We used Osmosis, which is a command line Java application for processing OSM data. You can easily install it on a Mac with
brew. We used it to extract all the POI data based on a selection of features.
europe-latest.osm.pbf --tf accept-nodes aerialway=station aeroway=aerodrome,helipad,heliport amenity=* building=school,university craft=* emergency=* highway=bus_stop,rest_area,services historic=* leisure=* office=* public_transport=stop_position,stop_area railway=station shop=* tourism=* --tf reject-ways --tf reject-relations --write-xml Europe.nodes.osm
Europe.nodes.osm contains all POI in Europe, but also some data that we do not need. A handy tool to scrub OSM data is osmconvert. With this tool, selected data can be dropped from the file.
osmconvert Europe.nodes.osm — drop-ways — drop-author — drop-relations — drop-versions Europe.poi.osm
The third step is to convert POI data to the GeoJSON format. A good tool for this job is ogr2ogr, which is part of the GDAL library, which you can install with
brew install gdal. Note that we are only interested in points, so only POI data is added to the GeoJSON file
ogr2ogr -f GeoJSON Europe.poi.json Europe.poi.osm points
Uploading Data to Cloudant
Each of the POI objects from the large GeoJSON file needs to be stored in a separate document in the database. To upload them to Cloudant we used couchimport, which does exactly that (and more).
IBM Cloudant is a NoSQL database that you can try out for free after signing up for a Bluemix account. Cloudant has a perpetually free tier, but please check Cloudant pricing if you anticipate heavier long-term use. For example, scraping POI data for the whole world took us 5.26 GB!
export COUCH_TRANSFORM=./osm_poi_transform.js export COUCH_URL=''https://username:firstname.lastname@example.org'' cat Europe.poi.json | couchimport --db poi-db --type json --jsonpath ''features.*''
These commands upload all POI features to a database called
poi-db. The file
osm_poi_transform.js contains extra information to use the
osm_id as the document
id and to format the keywords.
Keeping the data up-to-date is done by weekly running a Python script that downloads the OSM change file and uses the above tools to create a GeoJSON file with all new or updated POI.
As the change file contains both new and updated POI, the Cloudant Python library is used instead of couchimport. With this library, a POI record can be replaced, or if it is a new POI record, added via the following code from our POI API service:
Easy Access to the Data
Now that the database is ready, it is time to look at the data inside it. You can visualize GeoJSON inside the Cloudant dashboard, or by using the Cloudant APIs. To be able to use Cloudant’s geospatial functionalities, a design document with a geospatial index function needs to be added as in the screenshot below.
After the index has been built (processing can take a while for a large database), you can explore the data in the dashboard by, for instance, drawing a box on a map as below. Interacting with the map in the dashboard will also give you the corresponding API call for this query. It’s a convenient feature for further extending your query with some hints from the getting started example and Cloudant Geospatial documentation.
Analyse the Data in a Python Notebook
Another way to access and analyse the data is in a Python notebook. The examples below are designed for you to be able to easily copy & paste them into a Jupyter Notebook. You can run your notebook locally or in the cloud. We ran ours in the cloud using the IBM Data Science Experience (DSX) platform, which you can try out for free.
With the pandas and PixieDust packages, you can use the URL from the Cloudant dashboard above to start exploring. The code below will load a JSON file with data of the 200 POIs from the above map into a pandas DataFrame. To load the properties of the POI data, add
&include_docs=trueto the URL.
The DataFrame needs some cleaning up, as all the variables are combined into one column:
rows. Extracting the fields you are interested in can be done with a
lambda function, which is included below. It uses the function
try_field for each row. It checks if a field exists, and if it does writes the value to a new column. This code example only checks a few fields, but there are many more, as you can see in the features selected with
osmosis above. After adding the new columns, the original columns
rows can be dropped.
Create a Map with PixieDust
PixieDust is a great Python package to quickly visualize your data in Jupyter Notebooks. The formatted data above can be plotted on a map with the following code. First, you’ll need to add an extra column to specify which points are a shop, public transport, or an amenity. Then you can make a map by simply using the
display() command and selecting a map from the menu.
PixieDust has two map renderers. To visualize your POI data, you’ll need to choose mapbox. (Currently, the google maps render in PixieDust only uses simple location data, like country codes, and not latitude & longitude.) As such, you’ll need a Mapbox access token, which you can get for free by signing up for an account. Enter it in your visualization’s Options dialog, like so:
You’ll want to specify latitude and longitude as your keys, with a numeric value like shops or amenities as your value.
Use the Points of Interest API
You can also try this analysis using our POI API. Connecting to it is a little simpler and cleaner than loading the data directly from Cloudant, and you can grab more data in one call.
Try it out by replacing the corresponding notebook cells with the following snippets:
Some Final Thoughts
As you might have noticed, there is no password needed to access this data set. As the OSM data is open data, we are keeping this POI database open as well. Feel free to have a play with the data. We would love to hear what you are building!
Originally posted at medium.com/
IBM is a global technology and innovation company headquartered in Armonk, NY. It is the largest technology and consulting employer in the world, with more than 375,000 employees serving clients in 170 countries. Just completing its 22nd year of patent leadership, IBM Research has defined the future of information technology with more than 3,000 researchers in 12 labs located across six continents. Scientists from IBM Research have produced six Nobel Laureates, 10 U.S. National Medals of Technology, five U.S. National Medals of Science, six Turing Awards, 19 inductees in the National Academy of Sciences and 20 inductees into the U.S. National Inventors Hall of Fame. Today, IBM is much more than a “hardware, software, services” company. IBM is now emerging as a cognitive solutions and cloud platform company. Our work and our people can be found in all sorts of interesting places. IBMers are helping transform healthcare, improving the retail shopping experience, rerouting traffic jams and even designing the next generation fan experience in sports stadiums around the world. It's the kind of thing we've been doing for more than 100 years. This is IBM's official LinkedIn account and it follows IBM Social Computing Guidelines. We reserve the right to delete comments that are offensive or suggestive, personal attacks, anonymous, wildly off-topic, spam or advertisements. Specialties Cloud, Mobile, Cognitive, Security, Research, Watson, Analytics, Consulting, Commerce, Experience Design, Internet of Things, Technology support, Industry solutions, Systems services, IT iinfrastructure, Resiliency services, and Financing