Will McGinnis

Will McGinnis

Senior Architect - Predikto

Bio: I like building systems. I have a background in Mechanical Engineering, but work primarily in software, especially large scale data intensive systems for things like optimization, numeric analysis, machine learning, and inference. I like Python, ElasticSearch, Apache Spark, Machine learning, Solving actual problems.

Git-Pandas caching for Faster Analysis

Git-Pandas caching for Faster Analysis

Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories.  It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we’ve made that work with running analyses offline, and […]

Basen Encoding and Grid Search in Category_Encoders

Basen Encoding and Grid Search in Category_Encoders

In the past I’ve posted about the various categorical encoding methods one can use for machine learning tasks, like one-hot encoding, ordinal or binary.  In my OSS package, category_encodings, I’ve added a single scikit-learn compatible encoder called BaseNEncoder, which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, […]

Mixed-mode Estimation in Petersburg

Mixed-mode Estimation in Petersburg

A couple of months ago I posted an overview of simple estimation of hierarchical events using python and petersburg. At the time it probably seemed a little bit trivial, just building a structured frequency model and drawing samples from it. But I have finally implemented the next step to complete the intended functionality. This post […]

Beyond One-hot: Sklearn Transformers and Pip Release

Beyond One-hot: Sklearn Transformers and Pip Release

I’ve just released version 1.0.0 of category_encoders on pypi, you can check out the source here: https://github.com/wdm0006/categorical_encoding In two previous posts (here and here), we discussed and examined the differences between encoding methods for categorical variables.  It turns out they all are a bit different and make different assumptions, and so you end up with […]

Even Further Beyond One-hot: Feature Hashing

Even Further Beyond One-hot: Feature Hashing

In the previous post about categorical encoding we explored different methods for converting categorical variables into numeric features.  In this post, we will explore another method: feature hashing. Feature hashing, or the hashing trick is a method for turning arbitrary features into a sparse binary vector.  It can be extremely efficient by having a standalone hash […]

Beyond One-Hot: An Exploration of Categorical  Variables

Beyond One-Hot: An Exploration of Categorical Variables

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Decision Strategies: Beyond Expected Value

Decision Strategies: Beyond Expected Value

Oftentimes when making some kind of uncertain decision, the decision maker will use a measure such as expected value to make that decision.  Imagine the case of a single coin flip where the better pays 5 dollars to play, and gets 2 dollars for heads and 10 dollars for tails.  The expected value of this […]

Beyond One-hot: an Exploration of Categorical Variables

Beyond One-hot: an Exploration of Categorical Variables

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Even Further Beyond One-Hot: Feature Hashing

Even Further Beyond One-Hot: Feature Hashing

In the previous post about categorical encoding we explored different methods for converting categorical variables into numeric features.  In this post, we will explore another method: feature hashing. Feature hashing, or the hashing trick is a method for turning arbitrary features into a sparse binary vector.  It can be extremely efficient by having a standalone hash […]