Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories. It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we’ve made that work with running analyses offline, and […]

In the past I’ve posted about the various categorical encoding methods one can use for machine learning tasks, like one-hot encoding, ordinal or binary. In my OSS package, category_encodings, I’ve added a single scikit-learn compatible encoder called BaseNEncoder, which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, […]

A couple of months ago I posted an overview of simple estimation of hierarchical events using python and petersburg. At the time it probably seemed a little bit trivial, just building a structured frequency model and drawing samples from it. But I have finally implemented the next step to complete the intended functionality. This post […]

I’ve just released version 1.0.0 of category_encoders on pypi, you can check out the source here: https://github.com/wdm0006/categorical_encoding In two previous posts (here and here), we discussed and examined the differences between encoding methods for categorical variables. It turns out they all are a bit different and make different assumptions, and so you end up with […]

In the previous post about categorical encoding we explored different methods for converting categorical variables into numeric features. In this post, we will explore another method: feature hashing. Feature hashing, or the hashing trick is a method for turning arbitrary features into a sparse binary vector. It can be extremely efficient by having a standalone hash […]

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Oftentimes when making some kind of uncertain decision, the decision maker will use a measure such as expected value to make that decision. Imagine the case of a single coin flip where the better pays 5 dollars to play, and gets 2 dollars for heads and 10 dollars for tails. The expected value of this […]

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

In the previous post about categorical encoding we explored different methods for converting categorical variables into numeric features. In this post, we will explore another method: feature hashing. Feature hashing, or the hashing trick is a method for turning arbitrary features into a sparse binary vector. It can be extremely efficient by having a standalone hash […]