A Budget of Classifier Evaluation Measures

A Budget of Classifi...

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?” My concrete advice is: Read Nina Zumel’s excellent series on scoring classifiers. Keep notes. Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you […]

Beyond One-Hot: An Exploration of Categorical  Variables

Beyond One-Hot: An E...

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Better Python Compressed Persistence in joblib

Better Python Compre...

Problem setting: persistence for big data Joblib is a powerful Python package for management of computation: parallel computing, caching, and primitives for out-of-core computing. It is handy when working on so called big data, that can consume more than the available RAM (several GB nowadays). In such situations, objects in the working space must be […]

Beyond One-hot: an Exploration of Categorical Variables

Beyond One-hot: an E...

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Diving Deep into Python, the not-so-obvious Language Parts

Diving Deep into Pyt...

Sections Sections The C3 class resolution algorithm for multiple class inheritance Assignment operators and lists – simple-add vs. add-AND operators True and False in the datetime module Python reuses objects for small integers – use “==” for equality, “is” for identity And to illustrate the test for equality (==) vs. identity (is): Shallow vs. deep […]

Introduction to Python

Introduction to Pyth...

I’ve been trying to learn how to program since I was ten years old. I tried many times – mostly because my dad is a developer and wanted to share the thing he loves – but Java, C, and C++ always looked scary. I couldn’t really get into it. There was too much I had […]

Processing the Language of Pitchfork Part 2: Word Count

Processing the Langu...

In the second part of this three-part ODSC series on analyzing Pitchfork album reviews, we’ll introduce the Natural Language Toolkit library to discover patterns, trends, and other interesting things hidden in the words of album reviews. For this article I found the most commonly used words and adjectives/adverbs in my collection of 17,000 reviews. I also […]

Approximate Nearest News

Approximate Nearest ...

As you may know, one of my (very geeky) interests is Approximate nearest neigbormethods, and I’m the author of a Python package called Annoy. I’ve also built a benchmark suite called ann-benchmarks to compare different packages. Annoy was the world’s fastest package for a few months, but two things happened. FALCONN (FAst Lookups of Cosine […]

Introduction to Flask as a Micro-framework

Introduction to Flas...

For those of you who are not familar with it, Flask is a web development framework written in Python. To understand how to use Flask, let’s first consider the definition of a framework. Def: Framework := A framework in coding is a set of classes, functions, and variables that form a mindset for thinking about […]