Editor’s Note: See Phil present his talk “Python for Data Acquisition” at ODSC West 2019.
What does it take, on the technical side, to get a project started? After you have an idea and find something you want to study or look into, you need to get some data. Where do you get data? Primary sources? Web sites? Database? There are many different sources and possibilities. Which ones should you choose? How can you trust that the data remains and allows for reproducibility? Will it be easy to update once new data becomes available? These are all just the beginnings of the issues involved in acquiring data for your project, but you can use Python for data acquisition to make it easier.
There are so many sources of freely available public data. The US Federal Government runs Data.gov for its public data. The topics covered on this site include everything the government runs such as agriculture, climate, education, transportation, and energy. Individual divisions of the federal government, like NASA, may also have their own open data. Most states and cities also run web sites with a lot of data. ODSC West 2019 is in San Francisco and they have their own web site of local government data, Data SF.
Other governments and NGO’s have the same features
[Related Article: 25 Excellent Machine Learning Open Datasets]
- European Union
- Data World
Google has Public Data Directory, Amazon AWS Open Data, Microsoft, and IBM Cloud Data Services all have open data sets for public use. Github keeps track of so many more sites, like Awesome Public Datasets. With a little looking around, there is a set for almost anything you want to study!
Even with this almost infinite supply of options, doesn’t mean that the data is ready to go for your application or model. You still need to actually download this data and parse it into a usable format. The data on these sites are stored in a variety of different formats. They range from GIS, CSV, XML, JSON, text, HTML and various binary types. It is quite possible for your project to need data from multiple sources and in multiple formats. This can create a variety of issues for any project getting started or continuing on.
Once we have this data, how do we hang on to it? For each application or model that is built on this, do we want to download it again? What happens if the website goes away or changes its policies or changes its format? Storing all of this data in your own database can ease this issue. Once you have downloaded, cleaned up and gotten your data ready, store it in a local database. From there all future applications and models only need to access the database without worrying about all of the other issues of getting the data.
This is where Python comes in! It can handle all of these tasks with the right libraries and some coding. Python has libraries to cover all of these topics and then some. Using the Requests Library, downloading web pages and other files is very simple. With the correct information, it can also log into a server for non-public or restricted data. If the files are compressed, Python has archiving libraries for this. For the various formats there are Python libraries such as CSV, JSON, and regular expressions. From here storing data in a database can be done by wrapping SQL in python via psycopg2 or creating ORMs in SQL Alchemy.
[Related Article: Jupyter Notebook: Python or R—Or Both?]
This Course on Python for Data Acquisition
The goal of this course, which I’ll be presenting at ODSC West 2019, is to expose all of the students to this process and give them a few labs where they will get to do this. The students will learn to parse various data file formats, download data and interact with a database for storing and retrieving data.