Searching for a Data Colander for Automatic Data Cleaning Searching for a Data Colander for Automatic Data Cleaning
Editor’s note: The following is an article written by Devavrat Shah of MIT and Christina Lee Yu of Cornell University. Be... Searching for a Data Colander for Automatic Data Cleaning

Editor’s note: The following is an article written by Devavrat Shah of MIT and Christina Lee Yu of Cornell University. Be sure to check out their presentation at ODSC East 2019, Predictions in Excel through Estimating Missing Values”.”

The stated mission of Data Science or Data-Driven AI is to

extract insights from data and use them to make better decisions.

Every day, every hour, and every minute, numerous data analysts and scientists carefully comb through first-party and third-party data via Excel(-like) interfaces to work towards achieving this mission. It is a serious understatement to characterize this task as not easy.

To begin, this requires getting data or information in one place. This is a BIG challenge because data needs to be pulled from various systems and sources; some are easy and some are not. This invariably requires help from other parts of the organization beyond the control of a meager data analyst or science team.

[Related article: ODSC East 2019 Sneak Peek: Insights from 19 Data Science Experts]

The next step is to understand the data. This involves slicing-n-dicing aspects of data, using charting and visualization tools to learn from data, and then making a full circle with this interaction to make sure that the right aspects are understood. This results in the first useful outcome – meaningful dashboards and reports.

The following step is to use our understanding of the data to build “models” that can help predict unknowns missing data or forecast what’s coming in the future. This task requires team effort – sandboxing is easy, making it useful in a “production” setting is challenging. To make matters worse, predictions or forecasts on their own are amusing but of little business value, unless put to work – at best, they are penultimate step in the process.

This brings us to the last step of decision making. In this day and age, we are told the fairy tale of Data Science or AI that everything will be automated and we will not need to do anything. Other than a few exceptions, we strongly disagree with this “black-n-white” view of the world. We believe that Data Science or AI is going to help us make decisions by providing tools and frameworks to streamline the process. In that sense, decision making involves evaluating various scenarios using the learned predictive models, and subsequently making the right decision subject to various constraints and with an eye on the right objective.

This entire mission of Data Science relies on the fact that there is access to “clean” and “trustable” data. Most, if not all, of the times, this is not the case. To begin with, there are always typos. Then, there are duplicates for many different reasons. The data is not uniform – December 1, 2011 and 12/01/2011 means the same to you and me, but not the machine; the same challenge holds for equating “Denim” and “Jeans”, or “Yes” and “yes”, and the list goes on. The data schemas have names that are not uniform across organizations or to make matters worse, they are not even understandable within the same organization or across time. Finally, in many places, the information is missing due to data entry errors or incomplete datasets.

This means that decisions made using insights obtained based on such data can lead to very misleading answers. To avoid such pitfalls, Data Analysts and Scientists end up spending massive amounts of their time manually fixing such errors. It is mind numbing. There are definitely better ways to utilize this time and talent. This is exactly what Data Science, as a discipline, needs to do.

In short, we are in a dire need of a wonder data colander or strainer that can work with an Excel(-like) environment and clean the data up (to the extent possible) automatically.

Learn more at the “Predictions in Excel through Estimating Missing Values” workshop on May 3, 11:00am -12:30pm at ODSC East 2019.


More on the authors:

Devavrat Shah is a Professor with the department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology. His current research interests are at the interface of Statistical Inference and Social Data Processing. His work has been recognized through prize paper awards in Machine Learning, Operations Research and Computer Science, as well as career prizes including 2010 Erlang prize from the INFORMS Applied Probability Society and 2008 ACM Sigmetrics Rising Star Award. He is a distinguished young alumni of his alma mater IIT Bombay.





Christina Lee Yu is an Assistant Professor at Cornell University in Operations Research and Information Engineering. Prior to Cornell, she was a postdoc at Microsoft Research New England. She received her PhD and MS in Electrical Engineering and Computer Science from Massachusetts Institute of Technology, and her BS in Computer Science from California Institute of Technology. She received honorable mention for the 2018 INFORMS Dantzig Dissertation Award. Her research focuses on designing and analyzing scalable algorithms for processing social data based on principles from statistical inference.




ODSC gathers the attendees, presenters, and companies that are shaping the present and future of data science and AI. ODSC hosts one of the largest gatherings of professional data scientists with major conferences in USA, Europe, and Asia.