Part 2: The Data Science Ecosystem, Data Wrangling
There was a money quote from Michael Cavaretta, a data scientist at Ford Motors, in a recent article in the NY Times. The piece was about the challenges data scientists face going about their daily business. Cavaretta said: “We really need better tools so we can spend less time on data wrangling and get to the sexy stuff.” Data wrangling is cleaning data, connecting tools, and getting data into a usable format; the sexy stuff is predictive analysis and modeling. Considering the first is sometimes referred to as “janitor work,” you can guess which one is a bit more enjoyable.
In our recent survey, we found that data scientists spent a solid 80% of their time wrangling data. Given how expensive of a resource data scientists are, it’s surprising there are not more companies in this space.
In our last section, I noted how structured databases were originally built for finance and operations while unstructured ones were pushed forward by data scientists. I see a similar thing happening in this space. Since structured databases are an older industry, there were already myriad tools available for operations and finance people who have always worked with data. But there are also a new class of tools designed specifically for data scientists who have many of the same problems, but often need additional flexibility.
We’ll start with an area I know well.
Data enrichment improves raw data. Original data sources can be messy, in different formats, from multiple applications (and so on) which makes running predictive analysis on it difficult, if not impossible. Enrichment cleans that data so data scientists don’t have to.
I’ve broken this category into “human” and “automated,” but both approaches involve both people and machines. Human data enrichment means taking every row of your data set and having a human being transform it, but this requires a lot of computer automation to keep it reliable. Likewise, automated data enrichment involves setting up rules and scripts to transform data but requires a human to set up and check those rules.
Human enrichment relies on the fact that there are tasks people are simply better at than machines. Take image classification, for example. Humans can easily tell if a satellite photo contains clouds. Machines still struggle to consistently do that.
Language is another big use case for human data enrichment. Natural language processing algorithms can do amazing things, but they can’t spot sarcasm or irony or slang nearly as well as a person can. You’ll often see PR firms and marketers analyze sentiment this way.
Human-enriched daa can also be used to train search algorithms, and people can read and collect disparate information better than a machine can. Again, this requires the tasks to be well set up, and for the software to contain quality control safeguards, but if you get thousands of people working in tandem on simple jobs that people do better than machines, you can enrich tons of data at impressive speeds. Our company, CrowdFlower, is in this space, but so are others likeWorkFusion and in some ways, Amazon Mechanical Turk.
Automated enrichment has the same goals, but works with scripts and having machines transform raw data into usable data, instead of people. As I mentioned above, you still need a smart data scientist inputting that information and checking it when enrichment is complete, but automated enrichment can be incredibly powerful if all the i’s are dotted. Data with small errors and inconsistencies can be transformed into usable data near instantaneously with the right scripting.
Notably, automated solutions work well for cleaning data that doesn’t need a human eye. Examples range from simple tasks like name and date formatting to more complicated ones like dynamically pulling in meta data from the internet. Trifacta, Tamr, Paxata, and Pentaho come to mind as great automated solutions, but this is a space I see growing quickly as companies rush in to give data scientists some of their valuable time back.
ETL stands for Extract, Transform, and Load and the name gets to the heart of what the tools in this section of our ecosystem do. Essentially, what ETL/Blending solutions do for data scientists is take dissimilar data sources and marry them so analysis can be done.
Here’s an example of what I’m talking about. Say you have a financial database that contains a list of your customers, how much they pay, and what they buy. That lives in one place. Now say you have a differentdatabase containing each customer’s geographical business address. The companies in this space help combine that data into a single, usable database, so a data scientist can look into questions like which regions buy the most of a certain product, which parts of the country are your target market, etc.
And this is just a simple example; they can get much more complex. But essentially every data scientist will need to do some blending in their day-to-day responsibilities. Multiple data sources are frequently all formatted differently and, if you wanted a holistic view of a client or your enterprise at large, you’d need to blend these sources together to do deep analysis.
Alteryx, Astera, CloverETL, and etleap all have software that can handle this sort of data blending. And though ETL has been around since the days of structured databases, it figures to become increasingly vital. After all: more data sources means more discordant formatting. The promise of big data rests on being able to get both a granular and bird’s eye view of any of this information, for whatever analysis needs doing.
Data integration solutions overlap significantly with ETL/Blending software. Companies in both spaces aim to integrate data, but data integration is more concerned with unifying dataapplications and specific formats (as opposed to working on generic sets of data).
Think of what I mentioned last time, how there are third-party cloud applications that take care of everything from sales and marketing data to social reach and email campaigns. How do you combine each application into a usable data set on which a data scientist can run predictive analysis? With software like ClearStory or Databricks or SnapLogic.
Informatica has been in the space for years and does over a billion dollars of revenue. They also do quite a bit of work in each category of data wrangling as I’ve defined it here. Microsoft actually has two large offerings that would fit in this category: Azure Data Factory and SQL Server Integration Services.
Much like the ETL/blending tools, data integration programs are mainly focused on combining data from the left side of our ecosystem so it can be modeled by software on the right. In other words, integration tools like Apataror Zoomdata and the like allow you to marry data from cloud applications likeHootsuite or Gainsight so you can get BI from Domo or Chartio.
Lastly, let’s talk about API connectors. These companies don’t focus so much on transforming data as they do on integrating with as many separate APIs as possible. When companies like these started forming, I don’t think many of us predicted how big this space would actually be.
But these can be really, really powerful tools in the right hands. To start with a fairly non-technical example, I thinkIFTTT is a great way to understand what happens with an API connector. IFTTT (which stands for “if this, then that”) allows someone who posts an Instagram picture to save it immediately to their Dropbox or post it on Twitter. You can think of it as an API connector that a non-data scientist uses to stay on top of their online persona. But it’s important to include here because a lot of data scientists I talk to use it as a lightweight tool for personal applications and for work.
Zapier is like IFTTT but focused on being a lightweight connector for business applications, which may make it more relevant for many data science teams.
MuleSoft, meanwhile, connects all of your business applications. Say a user logs onto your site. Who needs to know about it? Does your sales team need the lead? Does your nurture team need to know that user is back again? How about marketing? Do they want to know their email campaign is working? A single API connector can trigger all these actions simultaneously.
Lastly, Segment.io connects your product to many of the SaaS business applications on the left of the inforgraphic and more.
API connectors simply don’t exist without the abundance of tools in this ecosystem to actually connect to. And while they weren’t totally designed for data scientists, data scientists use them, especially in conjunction with blending and integration tools.
OPEN SOURCE TOOLS
There are far fewer open-source data wrangling tools than data stores or in the analytics space. Google open-sourced their very interesting open-refine project. For the most part we see companies building their own ad-hoc tools mainly in Python, though Kettle is open-source ETL tool with some traction.