Distinguishing between Statistical Modeling and Machine Learning

Distinguishing betwe...

Editor’s note: This article will serve as a great overview. After reading it, we recommend listening the the podcast at the bottom, it may just broaden your understanding. If you are looking for it, here is one framework to distinguish statistical modeling from machine learning, and it is based on the desire for interpretability. In summary, if you […]

A Budget of Classifier Evaluation Measures

A Budget of Classifi...

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?” My concrete advice is: Read Nina Zumel’s excellent series on scoring classifiers. Keep notes. Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you […]

Kaggle FIFA Stats

Kaggle FIFA Stats...

The phenomenon of the “yearly sports game release” is a well established tradition in the videogame industry. The biggest is, perhaps, the FIFA franchise, reigning supreme leader in its niche, simulated soccer, for most of its over twenty year history. EA Sports released the latest iteration, FIFA 17, a few weeks ago to the usual […]

Stats Can’t Make Modeling Decisions

Stats Can’t Ma...

Here’s a question that appeared recently on the Reddit statistics forum: If effect sizes of coefficient are really small, can you interpret as no relationship?  Coefficients are very significant, which is expected with my large dataset. But coefficients are tiny (0.0000001). Can I conclude no relationship? Or must I say there is a relationship, but […]

Improved vtreat Documentation

Improved vtreat Docu...

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here). vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, […]

Probability is hard: part 4

Probability is hard:...

This is the fourth part of a series of posts about conditional probability and Bayesian statistics. In the first article, I presented the Red Dice problem, which is a warm-up problem that might help us make sense of the other problems. In the second article, I presented the problem of interpreting medical tests when there is uncertainty about […]

Probability is hard: part three

Probability is hard:...

This is the third part of a series of posts about conditional probability and Bayesian statistics. In the first article, I presented the Red Dice problem, which is a warm-up problem that might help us make sense of the other problems. In the second article, I presented the problem of interpreting medical tests when there is uncertainty […]

7 Ways I Got Trapped by Statistical Randomness

7 Ways I Got Trapped...

Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere — if a process is truly random, then it is not predictable, in the analytic sense of that term.  Randomness refers to the absence of patterns, order, coherence, and predictability in a system. Unfortunately, we are often fooled by random […]

Probability is hard, part two

Probability is hard,...

If you read the previous post, you know that my colleague Sanjoy Mahajan and I have been working on a series of problems related to conditional probability and Bayesian statistics.  In the previous article, I presented the Red Dice problem, which is relatively simple.  I posted it here because it presents four different versions of the […]