

Different Ways of Getting Datasets for Your Data Science Tasks
Modelingposted by Parul Pandey June 14, 2022 Parul Pandey

While going through the list of the articles that I have written to date, I discovered that quite a few were related to the concept of acquiring datasets for data science tasks. Some of those articles are targeted at finding good dataset websites, while others look at ways to create custom datasets. This article is a compilation of the various concepts covered in different articles. One can think of it as summarizing the multiple techniques while linking back to the original articles.
1. Advanced Google Search

Image by Author
Google search is by far the most common way to search for a dataset. But did you know that you could customize the search query to get accurate results and that, too, faster? In this article, we look at three ways to optimize our search on the internet.
Link: Advanced Google Search
2. Useful sites for finding datasets for Data Analysis tasks

Image by Author
Google search is great, but there are also dedicated sites harboring good-quality datasets. This article lists five such datasets with detailed video instructions on how to access them. Do not worry; I have left out the common ones like the UCI Machine Learning Repository, Kaggle datasets, and Data.gov and instead provided you with some of the lesser-known ones.
Link: Useful sites for finding datasets for Data Analysis tasks
3. Five Real-world datasets for honing your Exploratory Data Analysis skills

Real-world datasets
If you want to dive right into analysis without searching for the datasets, this article will be helpful. I have listed five datasets that are ideal for doing some good EDA and visualization. You can analyze the salary dataset or clinical trials report, or even air traffic data. The icing on the cake is that all of them are available on Kaggle, so you only need to spin a notebook to get started.
Link: 5 Real-World datasets for honing your Exploratory Data Analysis skills
4. Creating custom image datasets

Image by Author
If you are into deep learning and want to work on a project using your datasets, then in this article, I share five browser extensions, which make it pretty easy to bulk download the images. However, be sure not to download any image that violates the copyright terms.
5. Extracting data from HTML tables

Image by Author
Sometimes datasets available on the internet are presented in the form of HTML tables. At times such tables table are typically long and spread across the complete webpage. Also, data available in such forms may be dynamic, i.e., updated at regular intervals. As a result, it is not always useful to copy-paste it on the excel sheet. Scraping is an alternative, but there is even a more straightforward way. There exists a convenient function in Google Sheets called IMPORTHTML,
which is ideal for importing data from a table or list within an HTML page. This article describes the end-to-end process of fetching tables( and lists) into google sheets.
6. Extracting data from PDFs

Image by Author
Extracting tabular data from PDFs is hard. But what is even a bigger problem is that a lot of open data is available as PDF files. This open data is crucial for analysis and getting vital insights. However, accessing such data becomes a bottleneck. In this article, I discuss Camelot — an open-source Python library that can help you extract tables from PDFs easily. It also has a web interface called Excalibur for people who do not want to code but still want to use the library’s features.
Link: Extracting tabular data from PDFs made easy with Camelot
7. Extracting information from XML files
We have learnt to handle data in the form of HTML tables and PDF files. There is another category of data in the form of XML files that must be processed before it can be used. XML stands for Extensible Markup Language. As the name suggests, it is a markup language that encodes documents by defining a set of rules in both machine-readable and human-readable formats. In this article, I lay down the steps to convert XML data into an analysis-ready CSV file good enough to be ingested into the pandas’ library for further analysis.
Link: Extracting information from XML files into a Pandas dataframe
8. Reading data from clipboard into pandas dataframe

Image by Author
This article is about a very interesting function called the read_clipboard()
method of pandas creating a dataframe from data copied to the clipboard. It reads text from the clipboard and passes it to read_csv()
which then returns a parsed DataFrame object.

Reading data from clipboard into pandas dataframe | Image by Author
Conclusion
This article demonstrates multiple techniques to download datasets. Some of these techniques should come in handy when you venture out to find datasets for your next project. Alternatively, you could also create your own datasets and perform meaningful analysis from the downloaded data. The sky is the limit!
Article originally posted here. Reposted with permission.