ODSC Europe Super Early Bird Sale!

This deal has timed out, but the next deal might just around the corner, or find a way to contact us about writing a blog and we'll talk. See you at ODSC East!

Get 75% Off until Friday at 11pm

John Mount

John Mount

Consulting Algorithmist/Researcher

Bio: My specialty is analysis and design of algorithms, with an emphasis on efficient implementation. I work to find applications of state of the art methods in optimization, statistics and machine learning in various application areas. Currently co-authoring "Practical Data Science with R"

On indexing operators and composition

On indexing operators and composition

In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some […]

Teaching pivot / un-pivot

Teaching pivot / un-pivot

Co-written by John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy […]

Why do Decision Trees Work?

Why do Decision Trees Work?

In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal […]

A Budget of Classifier Evaluation Measures

A Budget of Classifier Evaluation Measures

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?” My concrete advice is: Read Nina Zumel’s excellent series on scoring classifiers. Keep notes. Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you […]

Laplace Noising Versus Simulated Out of Sample Methods (cross frames)

Laplace Noising Versus Simulated Out of Sample Methods (cross fra...

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, […]

Databases In Containers

Databases In Containers

A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool. In her case the tools were the data manipulation grammars SQL […]

sample(): The “Monkey’s Paw” Style

sample(): The “Monkey’s Paw” Style

The R functions base::sample and base::sample.int are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due […]

Improved vtreat Documentation

Improved vtreat Documentation

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here). vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, […]

For Loops Can Lose Class Information

For Loops Can Lose Class Information

Did you know R‘s for() loop control structure drops class annotations from vectors? Consider the following code R code demonstrating three uses of a for-loop that one would expect to behave very similarly. Notice in the third for loop the di print as numbers. This is because running through the dates in this way loses […]