In this blog post, I introduce my OSDC presentation, scheduled for November 2nd at 10:40 PST. I talk to a lot of data scientists and am involved in a startup (Tamr, Inc.) which applies machine learning (ML) to the challenge of integrating multiple independently constructed data sets. I have never met an enterprise data scientist who claims to spend less than 80% of his/her time finding data sets of interest, integrating them together and performing data cleaning on the result. Basically, a data scientist has a project at hand, and he/she needs to find a data set (or sets) to analyze to “solve” the task at hand. Invariably there is more than one and they must be merged. Since I might be Mike Stonebraker in one data set and M.R. Stonebreaker in a second, this merging must be performed without the benefit of primary keys. Also, generally one should figure that 10% of any data set is incorrect or missing values. In the presentation, I will explain why data integration is so hard and I will suggest constructive steps that an enterprise can take to make it easier.
I then turn to the second leg of the stool, which is visualization. I claim that a powerful vis system is a great complement to an analysis package. An analysis tool is a powerful arrow in the quiver of a data scientist. However, it assumes you know what you want to analyze. In other words, if you know you want to discover the relationship between employees’ ages and their salaries. then one can run a regression on the appropriate data. However, suppose you are not sure what is important about the data. In this case the real query is “tell me something interesting”. Correlating everything to everything may not be a useful first step. Instead, I argue that one should present the data using a powerful visualization tool, because one’s eyeballs are a really powerful pattern detector. Hence, I argue that a data scientist should have both tools in his arsenal.
The third leg of the stool concerns analysis. Rather than talk about specific analysis techniques, I present two pieces of “pragma” concerning machine learning, which I believe everyone should take to heart. For example, deep learning is all the rage these days, but I argue that it is far from the universal approach to all ML problems.
Finally, many data scientists use files to store their data sets about a project. I explain why I think this is a particularly bad idea. Instead, data scientists should become facile with DBMS technology.