- By creating, capturing, and curating data, one can practice “data creationism” and be creative with data to make your own dataset.
- While Iris and Titanic are well-known datasets available to experiment with machine learning and data science, challenge yourself to create your own dataset. Anything can be data.
- Libraries like Beautiful Soup, Pandas, Traces, Seaborn, NumPy, Faker, Mechanical Soup, and others can be used to transform unstructured data into usable formats.
We live in a world of data. Everything from our daily commutes to the sites that we visit creates records upon records of information that goes largely unused; however, at the ODSC East 2018 Conference, Wealthsimple Data Engineer Max Humber pushes for “data creationism,” or curating, creating, and capturing information to make your own dataset.
Humber spends much of his time writing up Python scripts and sifting through messy numbers. On the daily, Humber builds regressions and gives meaning to the data at Wealthsimple, a Toronto-based FinTech startup aimed to automate and simplify investing for younger generations. Prior to moving into this role a year ago, Humber served as a data scientist for a credit company called Borrowell where he built credit models and deployment services.
Humber, however, takes an unconventional approach in his talk and advocates for using original datasets rather than popular machine learning data, and he provided several concrete methods and libraries to adopt.
“If you are building a new algorithm, giving a talk like this, or trying to teach someone about data science or machine learning, you have a choice. You can use these canned datasets like Titanic or the Digits Dataset…or you can build out your own bespoke data,” said Humber. “We will be using bespoke data.”
The question now is – how do you begin to make your own dataset?
Creating Data: From Data Structure to Visualization
To perform a thorough analysis on a dataset, much thought is needed to organize and insert the information in a querTyable way. Humber walks through each step of creating the Pandas DataFrame, which is a two-dimensional data structure much like a table, to using the Faker Python library to generate sample data for the DataFrame. He then explained the insertion process from the script to an SQLite database and introduced two new libraries: Altair and NumPy, to do regression analysis and visualizations.
Curating Data: Digging through the Dataset
When looking to find new data to use, fortunately, the multitudes of untouched datasets provide a wide selection for exploration. To show the curation process, Humber humored the crowd by taking his own Kindle highlights data and created a Markov chain, or a system where the output depends on the previous states, with Markovify that outputs quotes.
Curation, however, is not always as easy as gathering independent files. Commonly, unused datasets are published online, and it is the responsibility of a data scientist to then curate and make use of it. Humber uses this Markov chain quote generator example to return quotes on Goodreads. Using the Beautiful Soup HTML parser, Humber walks through the process of finding classes and tags, cleaning the resulting inner text with regular expressions, and then moving this data into a centralized data structure to then use for his own purposes. This example details a creative approach to the many possibilities that curating various datasets for new uses can create.
Capturing Data: Quantifying the Unquantifiable
What if you could build models of the things that you do on the day-to-day? Humber presented this possibility by capturing data on his own habits from the time in his apartment to his biking habits to even the blood alcohol concentration of his friends using spreadsheets, calculations, and scripts.
With the help of libraries like Traces, a library that allows datetime objects to serve as dictionary keys, and Mechanical Soup, Humber explained how one can quantify any aspect of one’s life and insert the newly collected data into a Pandas DataFrame. Humber showed how a Pandas DataFrame object can create their own dataset.
Overall, Humber’s talk presented a challenge when dealing with data: stray from the expected. While data is collected at every corner of our lives, even more items are quantifiable around us, and Humber encouraged creativity in finding the intersections where we can create, curate, and capture these datasets.