Another batch of Think Stats notebooks

Another batch of Thi...

Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.  When I am done, each chapter will have a notebook that shows the examples from the book along with some small exercises, with more substantial exercises at the end. If you are reading the […]

Statistics, Simians, the Scottish, and Sizing up Soothsayers

Statistics, Simians,...

A predictive model can be a parametrized mathematical formula, or a complex deep learning network, but it can also be a talkative cab driver or a slides-wielding consultant. From a mathematical point of view, they are all trying to do the same thing, to predict what’s going to happen, so they can all be evaluated […]

More notebooks for Think Stats

More notebooks for T...

More notebooks for Think Stats As I mentioned in the previous post, I am getting ready to teach Data Science in the spring, so I am going back through Think Stats and updating the Jupyter notebooks.  I am done with Chapters 1 through 6 now. If you are reading the book, you can get the notebooks by cloning this […]

New notebooks for Think Stats

New notebooks for Th...

Getting ready to teach Data Science in the spring, I am going back through Think Stats and updating the Jupyter notebooks.  When I am done, each chapter will have a notebook that shows the examples from the book along with some small exercises, with more substantial exercises at the end. If you are reading the […]

Distinguishing between Statistical Modeling and Machine Learning

Distinguishing betwe...

Editor’s note: This article will serve as a great overview. After reading it, we recommend listening the the podcast at the bottom, it may just broaden your understanding. If you are looking for it, here is one framework to distinguish statistical modeling from machine learning, and it is based on the desire for interpretability. In summary, if you […]

A Budget of Classifier Evaluation Measures

A Budget of Classifi...

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?” My concrete advice is: Read Nina Zumel’s excellent series on scoring classifiers. Keep notes. Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you […]

Kaggle FIFA Stats

Kaggle FIFA Stats...

The phenomenon of the “yearly sports game release” is a well established tradition in the videogame industry. The biggest is, perhaps, the FIFA franchise, reigning supreme leader in its niche, simulated soccer, for most of its over twenty year history. EA Sports released the latest iteration, FIFA 17, a few weeks ago to the usual […]

Stats Can’t Make Modeling Decisions

Stats Can’t Ma...

Here’s a question that appeared recently on the Reddit statistics forum: If effect sizes of coefficient are really small, can you interpret as no relationship?  Coefficients are very significant, which is expected with my large dataset. But coefficients are tiny (0.0000001). Can I conclude no relationship? Or must I say there is a relationship, but […]

Improved vtreat Documentation

Improved vtreat Docu...

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here). vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, […]