Vector Models in Machine learning Part 2

Vector Models in Machine learning Part 2

This is a blog post rewritten from a presentation at NYC Machine Learning on Sep 17. It covers a library called Annoy that I have built that helps you do nearest neighbor queries in high dimensional spaces. In the first part, I went through some examples of why vector models are useful. In the second […]

Introduction to Trainspotting

Introduction to Trainspotting

This was originally posted on the Silicon Valley Data Science blog. At Silicon Valley Data Science, we have a slight obsession with the Caltrain. Our interest stems from the fact that half of our employees rely on the Caltrain to get to work each day. We also want to give back to the community, and […]

Beyond One-hot: Sklearn Transformers and Pip Release

Beyond One-hot: Sklearn Transformers and Pip Release

I’ve just released version 1.0.0 of category_encoders on pypi, you can check out the source here: https://github.com/wdm0006/categorical_encoding In two previous posts (here and here), we discussed and examined the differences between encoding methods for categorical variables.  It turns out they all are a bit different and make different assumptions, and so you end up with […]

Tips for Debugging Code without F-Bombs – Part 1

Tips for Debugging Code without F-Bombs – Part 1

Debugging code is a large part of actually writing code, yet unless you have a computer science background, you probably have never been exposed to a methodology for debugging code.  In this tutorial, I’m going to show you my basic method for debugging your code so that you don’t want to tear your hair out. […]

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning – Part 2

Model Evaluation, Model Selection, and Algorithm Selection in Mac...

Bootstrapping and Uncertainties: Introduction In the previous article (Part I), we introduced the general ideas behind model evaluation in supervised machine learning. We discussed the holdout method, which helps us to deal with real world limitations such as limited access to new, labeled data for model evaluation. Using the holdout method, we split our dataset […]

Neo4j on IBM POWER8 – Bigger Graphs and Better Performance

Neo4j on IBM POWER8 – Bigger Graphs and Better Performance

Today’s IT business leaders focused on BigData solutions have two big challenges – They need to manage massive volumes of data and they also need to rapidly generate insight from that data. As time has progressed we have been observing that one key insight is the existing and new relationships found by more deeply analyzing raw […]

Resolution Tweet Breakdown

Resolution Tweet Breakdown

It’s a new year everyone, which means new years resolutions. Many us are making the usual promises to ourselves about picking up or dropping certain habits. Thinking about new year’s resolutions prompted us here at opendatascience.com to think about and analyze people’s goals. While it’s impossible to know everyone’s resolution, the next best thing is what we can collect from […]

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning – Part 1

Model Evaluation, Model Selection, and Algorithm Selection in Mac...

 Introduction Machine learning has become a central part of our life – as consumers, customers, and hopefully as researchers and practitioners! Whether we are applying predictive modeling techniques to our research or business problems, I believe we have one thing in common: We want to make “good” predictions! Fitting a model to our training data […]

Even Further Beyond One-hot: Feature Hashing

Even Further Beyond One-hot: Feature Hashing

In the previous post about categorical encoding we explored different methods for converting categorical variables into numeric features.  In this post, we will explore another method: feature hashing. Feature hashing, or the hashing trick is a method for turning arbitrary features into a sparse binary vector.  It can be extremely efficient by having a standalone hash […]