As a data scientist, I’m data hungry. I’m always looking for new sources of data. A few years ago, I kept noticing new open data repositories coming online. For instance, I was excited to learn about the opening in 2015 of the Los Angeles Open Data website in my hometown. The public data is organized in a number of high-level categories: the economy, public safety, environment, city services, city budget, events and culture, parks and libraries, and transportation. At the state level, California also has an open data website. And of course, at the national level, there is data.gov, the U.S. government data repository with more than 250,000 data sets. There is even an open data search engine called Apertio.
Here’s a nice example of public data hacking using data straight from Wikipedia for mapping public debt data using the ggplot2 R package.
In this article, I will outline 10 top reasons why public data hacking is a good idea.
Let’s take a look at the list in no particular order:
- Using public data for doing good – Utilizing public data sources, you can focus on a meaningful goal for doing good. This is the main premise of the DataKind organization whose mission statement is “Harnessing the power of data science in the service of humanity” and uses public data for many of its projects. Take some time to review some public data resources and grab a few data sets that align with your personal passions, e.g. climate change, endangered species, traffic, crime, etc. Working on a data science project that’s meaningful to you makes it all the more intriguing and fun.
- Public data is remarkably clean – The public data sets I’ve used were very clean in terms of consistency, missing values, and values that make sense. This may be because the sponsors of public data repositories bear a responsibility to ensure the data is useable and ready for public consumption. Of course, clean data makes the job of a data scientist more streamlined since the data transformation (wrangling, munging) phase of the data science process simpler.
- Ample data volume – Many of the data sets from open data repositories are extensive, going back many years, and provide some rather high dimensionality. If you’re looking to do some big data experimentation, public data may be a great place to start.
- Varied data formats – The public data sites offer a variety of data formats including CSV, XML, JSON, etc. You’ll likely find a format you’re comfortable with.
- Data flexibility – Most open data repositories have very flexible means for selecting just the data you’re interested in. You can select subsets of variables, ranges of values (e.g. dollar ranges, date ranges, etc.), and categorical variable values (e.g. state=”MA”).
- Data diversity – You’ll find that each public data website offers a wide variety of data assets for many diverse problem domains. Sometimes I get new project ideas just browsing through collections of open data sets.
- New data scientists – If you’re a newbie data scientist trying to build up your resume, a great place to start is sharpening your skills with public data. You can choose an application close to your heart, like traffic patterns, air pollution, or crime stats. This way you’ll have extra motivation to ferret out new patterns and predictions to help the cause.
- Generate buzz for personal promotion – If your data science project using public data produces some surprising and/or useful results, you can write a paper summarizing your conclusions and submit it to a few appropriate organizations for promotion. If you’re lucky, you might get some good local press about your project.
- Capstone project material – If you’re in an academic program, including a MOOC specialization, public data sources can be a great basis for a capstone project.
- Obtain domain knowledge – Examining data is a great way to learn in general and also obtain specialized knowledge about a new domain space you may have an interest in. For example, maybe you’re interested in the healthcare industry, so examining public data sets describing the healthcare realm will serve to give you important insights.
I consider the act of public data hacking to be an important tool in the data scientist’s arsenal of problem-solving. Even if your project is primarily based on proprietary enterprise data assets, you can always supplement this data with public data to get even more important perspectives.