Combining data sets can be a huge pain, with possible problems both obvious and insidious. Aaron will present practical approaches for detecting and avoiding potential pitfalls, as well as rigorous and repeatable processes for generating merge tables through reduction to de-duplication. The focus will be on techniques for quickly achieving high accuracy for data sets of moderate size, with brief excursions into the entity resolution literature, machine learning for distance metrics, and applying clustering and visualization techniques including multidimensional scaling.
Aaron is a Data Science Instructor at Metis. Prior to Metis he was with Booz Allen Hamilton, where he helped government clients make effective use of data. His work has ranged from building visualization prototypes with the UK’s National Health Service and NYU’s GovLab to winning the Arlington Public Schools Big Data Roundtable by building predictive models for student outcomes. Aaron’s academic background is in math – after his BS at the University of Wisconsin-Madison, his MAT Mathematics thesis at Bard College demonstrated a novel construction of quaternion and octonion integers. Since then, Aaron has enjoyed working on more directly practical problems. Always passionate about education, Aaron first taught data science by that name in 2013; his students have gone on to work at companies including Airbnb, Infochimps, and Netflix.