Scraping OpenStreetMap and exploring POI in Cloudant and Jupyter Notebooks When working with data, the format of the raw data is...

Scraping OpenStreetMap and exploring POI in Cloudant and Jupyter Notebooks

When working with data, the format of the raw data is not always user-friendly. For instance, the format could be one large binary file, or the data could spread across hundreds of text files. An easy way to solve this problem is to convert the data and store it in a database.

As an example of how to make working with data simpler, Raj Singh and I converted all the Points of Interest data from the global OpenStreetMap(OSM) project to GeoJSON files, which we then stored and are periodically updating in IBM Cloudant, a database service based on Apache CouchDB™The data is now easily accessible through an API, which you can try for free.

Our Points of Interest API, based on OpenStreetMap POI data.


OpenStreetMap is built by a community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world.

Read along to learn how we built it and how you can use the data. (Note: Should you reproduce all the work described below, you will likely incur costs for Cloudant.)


OpenStreetMap Data

The first step is to download the most recent data for each continent. We used Geofabrik, which extracts, selects, and processes free geo data from OpenStreetMap. The examples that follow use data from Europe, but for complete global coverage, all steps need to be repeated for each continent.


Converting the Data

The second step is to extract the Points of Interest (POI) from this large file. We used Osmosis, which is a command line Java application for processing OSM data. You can easily install it on a Mac with brew. We used it to extract all the POI data based on a selection of features.

osmosis --read-pbf europe-latest.osm.pbf  --tf accept-nodes  aerialway=station  aeroway=aerodrome,helipad,heliport  amenity=* building=school,university craft=* emergency=*  highway=bus_stop,rest_area,services  historic=* leisure=* office=*  public_transport=stop_position,stop_area railway=station  shop=* tourism=*  --tf reject-ways --tf reject-relations  --write-xml Europe.nodes.osm

The file Europe.nodes.osm contains all POI in Europe, but also some data that we do not need. A handy tool to scrub OSM data is osmconvert. With this tool, selected data can be dropped from the file.

osmconvert Europe.nodes.osm — drop-ways — drop-author — drop-relations — drop-versions Europe.poi.osm

The third step is to convert POI data to the GeoJSON format. A good tool for this job is ogr2ogr, which is part of the GDAL library, which you can install with brew install gdal. Note that we are only interested in points, so only POI data is added to the GeoJSON file Europe.poi.json.

ogr2ogr -f GeoJSON Europe.poi.json Europe.poi.osm points


Uploading Data to Cloudant

Each of the POI objects from the large GeoJSON file needs to be stored in a separate document in the database. To upload them to Cloudant we used couchimport, which does exactly that (and more).

IBM Cloudant is a NoSQL database that you can try out for free after signing up for a Bluemix account. Cloudant has a perpetually free tier, but please check Cloudant pricing if you anticipate heavier long-term use. For example, scraping POI data for the whole world took us 5.26 GB!

export COUCH_TRANSFORM=./osm_poi_transform.js
export COUCH_URL=''https://username:password@opendata.cloudant.com''
cat Europe.poi.json | couchimport --db poi-db --type json --jsonpath ''features.*''

These commands upload all POI features to a database called poi-db. The file osm_poi_transform.js contains extra information to use the osm_id as the document id and to format the keywords.

Keeping the data up-to-date is done by weekly running a Python script that downloads the OSM change file and uses the above tools to create a GeoJSON file with all new or updated POI.

As the change file contains both new and updated POI, the Cloudant Python library is used instead of couchimport. With this library, a POI record can be replaced, or if it is a new POI record, added via the following code from our POI API service:



Easy Access to the Data

Now that the database is ready, it is time to look at the data inside it. You can visualize GeoJSON inside the Cloudant dashboard, or by using the Cloudant APIs. To be able to use Cloudant’s geospatial functionalities, a design document with a geospatial index function needs to be added as in the screenshot below.


Adding a geospatial index in Cloudant.



After the index has been built (processing can take a while for a large database), you can explore the data in the dashboard by, for instance, drawing a box on a map as below. Interacting with the map in the dashboard will also give you the corresponding API call for this query. It’s a convenient feature for further extending your query with some hints from the getting started example and Cloudant Geospatial documentation.


Selecting all data points within a rectangle.

Analyse the Data in a Python Notebook

Another way to access and analyse the data is in a Python notebook. The examples below are designed for you to be able to easily copy & paste them into a Jupyter Notebook. You can run your notebook locally or in the cloud. We ran ours in the cloud using the IBM Data Science Experience (DSX) platform, which you can try out for free.

With the pandas and PixieDust packages, you can use the URL from the Cloudant dashboard above to start exploring. The code below will load a JSON file with data of the 200 POIs from the above map into a pandas DataFrame. To load the properties of the POI data, add &include_docs=trueto the URL.

Using PixieDust to display(poi_df) in a separate notebook cell will render the POI data from Cloudant as a table.

The DataFrame needs some cleaning up, as all the variables are combined into one column: rows. Extracting the fields you are interested in can be done with a lambda function, which is included below. It uses the functiontry_field for each row. It checks if a field exists, and if it does writes the value to a new column. This code example only checks a few fields, but there are many more, as you can see in the features selected with osmosis above. After adding the new columns, the original columns bookmark and rows can be dropped.

You can re-run your display(poi_df) cell to see how the POI data has changed.


Create a Map with PixieDust

PixieDust is a great Python package to quickly visualize your data in Jupyter Notebooks. The formatted data above can be plotted on a map with the following code. First, you’ll need to add an extra column to specify which points are a shop, public transport, or an amenity. Then you can make a map by simply using the display() command and selecting a map from the menu.

PixieDust has two map renderers. To visualize your POI data, you’ll need to choose mapbox. (Currently, the google maps render in PixieDust only uses simple location data, like country codes, and not latitude & longitude.) As such, you’ll need a Mapbox access token, which you can get for free by signing up for an account. Enter it in your visualization’s Options dialog, like so:

Entering map visualization options via PixieDust, in a Jupyter Notebook on IBM’s DSX platform.

You’ll want to specify latitude and longitude as your keys, with a numeric value like shops or amenities as your value.

Rendering map data using Mapbox in PixieDust. Looks like the map previewed in the Cloudant dashboard, only more stylish and with more options!


Use the Points of Interest API

You can also try this analysis using our POI API. Connecting to it is a little simpler and cleaner than loading the data directly from Cloudant, and you can grab more data in one call.


Try it out by replacing the corresponding notebook cells with the following snippets:




Some Final Thoughts

As you might have noticed, there is no password needed to access this data set. As the OSM data is open data, we are keeping this POI database open as well. Feel free to have a play with the data. We would love to hear what you are building!


Originally posted at medium.com/


IBM is a global technology and innovation company headquartered in Armonk, NY. It is the largest technology and consulting employer in the world, with more than 375,000 employees serving clients in 170 countries. Just completing its 22nd year of patent leadership, IBM Research has defined the future of information technology with more than 3,000 researchers in 12 labs located across six continents. Scientists from IBM Research have produced six Nobel Laureates, 10 U.S. National Medals of Technology, five U.S. National Medals of Science, six Turing Awards, 19 inductees in the National Academy of Sciences and 20 inductees into the U.S. National Inventors Hall of Fame. Today, IBM is much more than a “hardware, software, services” company. IBM is now emerging as a cognitive solutions and cloud platform company. Our work and our people can be found in all sorts of interesting places. IBMers are helping transform healthcare, improving the retail shopping experience, rerouting traffic jams and even designing the next generation fan experience in sports stadiums around the world. It's the kind of thing we've been doing for more than 100 years. This is IBM's official LinkedIn account and it follows IBM Social Computing Guidelines. We reserve the right to delete comments that are offensive or suggestive, personal attacks, anonymous, wildly off-topic, spam or advertisements. Specialties Cloud, Mobile, Cognitive, Security, Research, Watson, Analytics, Consulting, Commerce, Experience Design, Internet of Things, Technology support, Industry solutions, Systems services, IT iinfrastructure, Resiliency services, and Financing