fbpx
Starting a Data Science Project
I spoke in a Webinar this past Saturday about how to get into Data Science. One of the questions asked was “What does a typical day look like?”  I think there is a big opportunity to explain what really happens before any machine learning takes place... Read more
Predicting the Truncated xorshift32* Random Number Generator
Software programmers need random number generators. For this purpose, they often use functions with outputs that appear random. Gerstmann has a nice post about Better C++ Pseudo Random Number Generator. He investigates the following generator: uint32_t xorshift(uint64_t *m_seed) { uint64_t result = *m_seed * 0xd989bcacc137dcd5ull; *m_seed ^=... Read more
How Quickly Can You Compute the Dot Product Between Two Large Vectors?
A dot (or scalar) product is a fairly simple operation that simply sums the many products: float sum = 0; for (size_t i = 0; i < len; i++) { sum += x1 * x2; } return sum; It is nevertheless tremendously important. You know these... Read more
Data Processing on Modern Hardware
If you had to design a new database system optimized for the hardware we have today, how would you do it? And what is the new hardware you should care about? This was the topic of a seminar I attended last week in Germany at Dagstuhl. Here... Read more
How To Create Data Products That Are Magical Using Sequence-to-Sequence Models
A tutorial on how to summarize text and generate features from Github Issues using deep learning with Keras and TensorFlow. Teaser: Training a model to summarize Github Issues Predictions are in rectangular boxes. The above results are randomly selected elements of a holdout set. Keep reading below, there will be a link... Read more
Watermain Breaks in the City of Toronto
It has been a while since my last post due to the major transition of moving back to Canada. This post will be a bit shorter than my previous ones but hopefully it will give some insight on practically investigating and analyzing open data that are... Read more
Plotting author statistics for Git repos using Git of Theseus
I spent a few days during the holidays fixing up a bunch of semi-dormant open source projects and I have a couple of blog posts in the pipeline about various updates. First up, I made a number of fixes to Git of Theseus which is a tool (written... Read more
Some things I’d like you to know about Data Science
Things I’ve learned mostly by making mistakes Masses of data + cutting edge machine learning + cheap compute = Profit. Right? It’s not that simple. Data science isn’t a replacement for asking difficult questions and doing hard work based on the answers. In fact, it’s quite the... Read more
Big aggregate queries can still violate privacy
Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies. Data scientists can only query aggregate statistics, such as counts... Read more
Business Analytics: Requirements for Data Transformation
There is a major change happening in the IT industry — the use of big data and analytics to guide how businesses are run. Many companies are embracing analytics as part of their core strategies. Unfortunately, some of those companies think that they can just purchase... Read more