Beyond One-Hot: An Exploration of Categorical  Variables

Beyond One-Hot: An E...

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Learning Reinforcement Learning (With Code, Exercises and Solutions)

Learning Reinforceme...

Skip all the talk and go directly to the Github Repo with code and exercises. WHY STUDY REINFORCEMENT LEARNING Reinforcement Learning is one of the fields I’m most excited about. Over the past few years amazing results like learning to play Atari Games from raw pixels and Mastering the Game of Go have gotten a […]

Beyond One-hot: an Exploration of Categorical Variables

Beyond One-hot: an E...

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Ad Hoc Distributed Random Forests

Ad Hoc Distributed R...

when arrays and dataframes aren’t flexible enough TL;DR. Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data. Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily. Motivation […]

Combining Human Knowledge with Machine Learning for Robust Data Flows

Combining Human Know...

Even if you’re working with 100% machine-created data, more than likely you’re performing some amount of manual inspection on your data at different points in the data analysis process, and the output of your machine learning models. Many companies including Google, GoDaddy, Yahoo!, and LinkedIn use what’s known as HITL, or Human-In-The-Loop, to improve the […]

Introducing Dask distributed #1

Introducing Dask dis...

tl;dr: We analyze JSON data on a cluster using pure Python projects. Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with a new distributed memory scheduler. This enables dask’s existing parallel algorithms to scale across 10s to 100s of nodes, and extends a subset […]

12 Algorithms Every Data Scientist Should Know

12 Algorithms Every ...

Algorithms have become part of our daily lives and they can be found in almost any aspect of business. Gartner calls this the algorithmic business and it is changing the way we (should) run and manage our organizations. There are all kinds of algorithms and for each aspect of your business, there are different algorithms, which […]

Single-Layer Neural Networks and Gradient Descent

Single-Layer Neural ...

This article offers a brief glimpse of the history and basic concepts of machine learning. We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for […]

An Introduction to Contextual Bandits

An Introduction to C...

In this post I discuss the Multi Armed Bandit problem and its applications to feed personalization. First, I will use a simple synthetic example to visualize arm selection in with bandit algorithms, I also evaluate the performance of some of the best known algorithms on a dataset for musical genre recommendations. What is a Multi […]