Google Dataset Search Launched to Help Analysts Scour Repositories
ToolsGoogleposted by Diego Arenas, ODSC October 18, 2018 Diego Arenas, ODSC
Google Dataset Search is a new product in the beta phase that you can use to find datasets published online. The single interface allows you to search repositories worldwide.
Imagine you start a new analytics project. For example, let’s say you want to explore numbers pertaining to Boston Public Schools. Before you would search for it in a search engine, but now you can search datasets and get direct links.
The results are listed on the left side of the screen. When you click on one, it will show you comprehensive metadata of the set: location, authors, license type, timeframe, available formats, a description, funders, etc.
Searching for datasets online is not typically an easy task. You expect a tabular file format, but instead, you’re often directed to websites that simply list the information. A useful tip is to include “data source” in your search. This will pull up tabular datasets among the first results. Or, now, you can use Google’s database tool.
The issue of finding datasets
We live in a time when we supposedly have more data than ever before. But finding the right datasets remains a significant problem.
For descriptive models, such as clustering or regression, we can use unknown datasets and do an exploratory analysis. When we create predictive learning models, we expect the test data we use will come in the same format, more or less. But it can be challenging to find new data in the exact same format.
Furthermore, while open data is great for transparency and accountability, published datasets are oftentimes summarized. And it makes sense not to publish all the detailed data. But the summaries can make it difficult to do exploratory data analysis because variables can’t open up and split the results.
We have more data than ever, and data portals and repositories selling data can be overwhelming. While Google Dataset Search still has many opportunities for improvement, it is good to have a single interface.
A semi-alternative to Dataset Search is to use Google Public Data Explorer, a platform where you can navigate and visualize global and local indicators. You have access to thousands of indicators from many datasets. These indicators allow you to study and learn about the status of the world.
I’d recommend reproducing the Hans Rosling analysis in this platform or exploring the status of regions you’re interested in. You will be surprised by the facts you can find. If you don’t find the data you are looking for is in the Public Data Explorer, you better use Google Dataset Search instead.
Now you can go and try it for yourself and tells us what you think.
Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!