A Budget of Classifier Evaluation Measures

A Budget of Classifi...

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?” My concrete advice is: Read Nina Zumel’s excellent series on scoring classifiers. Keep notes. Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you […]

Kaggle FIFA Stats

Kaggle FIFA Stats...

The phenomenon of the “yearly sports game release” is a well established tradition in the videogame industry. The biggest is, perhaps, the FIFA franchise, reigning supreme leader in its niche, simulated soccer, for most of its over twenty year history. EA Sports released the latest iteration, FIFA 17, a few weeks ago to the usual […]

Stats Can’t Make Modeling Decisions

Stats Can’t Ma...

Here’s a question that appeared recently on the Reddit statistics forum: If effect sizes of coefficient are really small, can you interpret as no relationship?  Coefficients are very significant, which is expected with my large dataset. But coefficients are tiny (0.0000001). Can I conclude no relationship? Or must I say there is a relationship, but […]

Improved vtreat Documentation

Improved vtreat Docu...

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here). vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, […]

Probability is hard: part 4

Probability is hard:...

This is the fourth part of a series of posts about conditional probability and Bayesian statistics. In the first article, I presented the Red Dice problem, which is a warm-up problem that might help us make sense of the other problems. In the second article, I presented the problem of interpreting medical tests when there is uncertainty about […]

Probability is hard: part three

Probability is hard:...

This is the third part of a series of posts about conditional probability and Bayesian statistics. In the first article, I presented the Red Dice problem, which is a warm-up problem that might help us make sense of the other problems. In the second article, I presented the problem of interpreting medical tests when there is uncertainty […]

7 Ways I Got Trapped by Statistical Randomness

7 Ways I Got Trapped...

Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere — if a process is truly random, then it is not predictable, in the analytic sense of that term.  Randomness refers to the absence of patterns, order, coherence, and predictability in a system. Unfortunately, we are often fooled by random […]

Probability is hard, part two

Probability is hard,...

If you read the previous post, you know that my colleague Sanjoy Mahajan and I have been working on a series of problems related to conditional probability and Bayesian statistics.  In the previous article, I presented the Red Dice problem, which is relatively simple.  I posted it here because it presents four different versions of the […]

Probability is hard

Probability is hard...

For more than a month, my colleague Sanjoy Mahajan and I have been banging our heads on a series of problems related to conditional probability and Bayesian statistics.  We knew when we started that this material is tricky, as demonstrated by veridical paradoxes like the Monty Hall problem, the Girl Named Florida, and so on. […]