Whatever you want to call it – data wrangling, data munging, or data transformation, the part of the Data Science Process sitting in between data acquisition and exploratory data analysis (EDA) is one of the core skills a data scientist must have. It includes a set of tasks you have to perform in order to understand your data and prep it for machine learning. The story has been told many times, that the data wrangling process can take up a sizable percentage of the time spent on the project by the data scientist, often reported as high as 75%.
Many may dismiss the role of a data wrangler as ordinary custodial work, but when done properly, it can help lead to accurate insights based on valuable enterprise data assets. The first step, however, is to make sure your data wrangling skills are up to snuff. In this article, I discuss some of the make-or-break data wrangling skills you’ll need for successful data science projects.
The Importance of Good Data Wrangling Skills
A good data wrangler knows how to integrate information from multiple data sources, solving common transformation problems, and resolve data cleansing and quality issues. A data wrangler also knows their data intimately, and is always looking for ways to enrich the data. Different from what’s taught in your local data science camp or specialization program, in the real world, you rarely get flawless data, especially when working with constantly evolving technology. This means that you have to know the business context for the data well enough to be able to interpret it, clean it, and transform it into an ingestible form. It may sound easy, but it is frustratingly not!
Data wrangling skills are so integral to the job, many leading tech companies typically ask new data science candidates to perform a series of data transformations, including merging, ordering, aggregation, etc., using data science programming languages R, Python, Julia, or even SQL, along with a specific data set designed to demonstrate their capabilities in this area. This way, hiring managers can test the right methodology and thought process, and how well the candidate can make reasoned judgements based on the underlying business context.
Without solid data wrangling skills, the rest of the data science process simply can’t progress in any meaningful way. Data scientists may try to get by with the barest effort in data wrangling, but they’ll quickly find they have little idea what to look for from their data sets. Yes, data wrangling takes a lot of time and requires a lot of effort, but it’s all worth it in the end. An important goal in acquiring excellent data wrangling skills is all about keeping your efforts efficient and consistent.
How to Approach Data Wrangling
Over time, data scientists will develop a code toolbox of commonly used data wrangling tasks so that when the occasion arises, they can just dip into their box of tricks to solve the problem at hand. My own data wrangling toolbox never stops growing as I encounter new requirements and situations. A lot of my toolbox deals with date handling and imputing missing values.
Aside from hand-coding solutions, there are a number of products that can kick-start the process without coding. Solutions from companies like Trifacta and Datawatch Monarch have proven to be quite popular.
Six Core Activities
Trifacta is a leading developer of data wrangling software for data exploration and self-service data preparation for analysis that doesn’t involve coding. The company offers a compelling list of six core data wrangling activities:
- Discovering – includes some of the EDA steps in the data science process, i.e. getting to know your data in terms of patterns and correlations. You’ll often work with a domain expert here.
- Structuring – since data comes in all shapes and sizes, you’ll need to be able to merge, order, and reshape the data to be suitable for machine learning.
- Cleaning – enterprise data is often dirty and inconsistent. Missing data values will affect the accuracy of your models. Date values can cause particular frustrations due to the many ways of representing dates in a database.
- Enriching – how can you derive data from what you already have? For instance, if you have a business address in your data set, for machine learning and data visualization purposes, it would be helpful to supplement the address with longitude and latitude values.
- Validating – validating the data is really the next step after cleaning, by taking a deeper look at the data values to make sure they make sense statistically and to the correct business context.
- Publishing – after completing the wrangling, you’ll need to integrate all the individual steps in a “data pipeline” so when the data set needs to be refreshed, you can simply re-run the pipeline and execute all the data wrangling tasks at once. You should fully document the steps so you won’t forget the decisions you made along the way.
It may not be as high profile as other steps in the data science process such as model selection, but this doesn’t mean it’s not important. Indeed, the importance of data wrangling shouldn’t be overlooked. Many projects live or die with this step being completed properly.