Bayesian Estimation, Group Comparison, and Workflow
Over the past year, having learned about Bayesian inference methods, I finally see how estimation, group comparison, and model checking build upon each other into this really elegant framework for data analysis. Parameter Estimation The foundation of this is “estimating a parameter”. In a typical situation, we are most... Read more
How Well Did Data Scientists Predict the 2018 World Cup? (Hint: Not Very)
This year’s World Cup in Russia was the most watched sporting event in history. GlobalWebIndex reports that up to 3.4 billion people – around half of the world’s population – watched some part of the tournament. As with past World Cups, a global prediction market emerged allowing spectators to... Read more
The Best Mario Kart Character According to Data Science
Mario Kart was a staple of my childhood — my friends and I would spend hours after school as Mario, Luigi, and other characters from the Nintendo universe racing around cartoonish tracks and lobbing pixelated bananas at each other. One thing that always vexed our little group of would-be speedsters was... Read more
Starting a Data Science Project
I spoke in a Webinar this past Saturday about how to get into Data Science. One of the questions asked was “What does a typical day look like?”  I think there is a big opportunity to explain what really happens before any machine learning takes place for a large... Read more
Predicting the Truncated xorshift32* Random Number Generator
Software programmers need random number generators. For this purpose, they often use functions with outputs that appear random. Gerstmann has a nice post about Better C++ Pseudo Random Number Generator. He investigates the following generator: uint32_t xorshift(uint64_t *m_seed) { uint64_t result = *m_seed * 0xd989bcacc137dcd5ull; *m_seed ^= *m_seed >> 11;... Read more
How Quickly Can You Compute the Dot Product Between Two Large Vectors?
A dot (or scalar) product is a fairly simple operation that simply sums the many products: float sum = 0; for (size_t i = 0; i < len; i++) { sum += x1 * x2; } return sum; It is nevertheless tremendously important. You know these fancy machine learning... Read more
Data Processing on Modern Hardware
If you had to design a new database system optimized for the hardware we have today, how would you do it? And what is the new hardware you should care about? This was the topic of a seminar I attended last week in Germany at Dagstuhl. Here are some thoughts:... Read more
How To Create Data Products That Are Magical Using Sequence-to-Sequence Models
A tutorial on how to summarize text and generate features from Github Issues using deep learning with Keras and TensorFlow. Teaser: Training a model to summarize Github Issues Predictions are in rectangular boxes. The above results are randomly selected elements of a holdout set. Keep reading below, there will be a link to many more... Read more
Watermain Breaks in the City of Toronto
It has been a while since my last post due to the major transition of moving back to Canada. This post will be a bit shorter than my previous ones but hopefully it will give some insight on practically investigating and analyzing open data that are becoming more popular... Read more
Plotting author statistics for Git repos using Git of Theseus
I spent a few days during the holidays fixing up a bunch of semi-dormant open source projects and I have a couple of blog posts in the pipeline about various updates. First up, I made a number of fixes to Git of Theseus which is a tool (written in Python) that... Read more
