Save 45% off ODSC East, it's just a few months away!

days

:

:

for an extra 20% off, use the code: ODSC20
Go
Here’s What Twitter Was Like During the Super Bowl.

Here’s What Tw...

The Patriots 34-28 win in Super Bowl 51 was, quite possibly, one of greatest football games of all time. It had the largest Super Bowl comeback of all time and was the first to ever to go to overtime. For data-minded folks, the game exhibited striking parallels to the election. ESPN’s live prediction models was saying that Atlanta was almost certain to […]

Implementing a Principal Component Analysis (PCA) in Python, step by step

Implementing a Princ...

Sections Sections Introduction Principal Component Analysis (PCA) Vs. Multiple Discriminant Analysis (MDA) What is a “good” subspace? Summarizing the PCA approach Generating some 3-dimensional sample data Why are we chosing a 3-dimensional sample? 1. Taking the whole dataset ignoring the class labels 2. Computing the d-dimensional mean vector 3. a) Computing the Scatter Matrix 3. […]

A Budget of Classifier Evaluation Measures

A Budget of Classifi...

Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?” My concrete advice is: Read Nina Zumel’s excellent series on scoring classifiers. Keep notes. Settle on one or two metrics as you move project to project. We prefer “AUC” early in a project (when you […]

Beyond One-Hot: An Exploration of Categorical  Variables

Beyond One-Hot: An E...

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Better Python Compressed Persistence in joblib

Better Python Compre...

Problem setting: persistence for big data Joblib is a powerful Python package for management of computation: parallel computing, caching, and primitives for out-of-core computing. It is handy when working on so called big data, that can consume more than the available RAM (several GB nowadays). In such situations, objects in the working space must be […]

Beyond One-hot: an Exploration of Categorical Variables

Beyond One-hot: an E...

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Diving Deep into Python, the not-so-obvious Language Parts

Diving Deep into Pyt...

Sections Sections The C3 class resolution algorithm for multiple class inheritance Assignment operators and lists – simple-add vs. add-AND operators True and False in the datetime module Python reuses objects for small integers – use “==” for equality, “is” for identity And to illustrate the test for equality (==) vs. identity (is): Shallow vs. deep […]

Introduction to Python

Introduction to Pyth...

I’ve been trying to learn how to program since I was ten years old. I tried many times – mostly because my dad is a developer and wanted to share the thing he loves – but Java, C, and C++ always looked scary. I couldn’t really get into it. There was too much I had […]

Processing the Language of Pitchfork Part 2: Word Count

Processing the Langu...

In the second part of this three-part ODSC series on analyzing Pitchfork album reviews, we’ll introduce the Natural Language Toolkit library to discover patterns, trends, and other interesting things hidden in the words of album reviews. For this article I found the most commonly used words and adjectives/adverbs in my collection of 17,000 reviews. I also […]