Git-Pandas caching for Faster Analysis

Git-Pandas caching f...

Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories.  It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we’ve made that work with running analyses offline, and […]

A Gentle Introduction to Recommender Systems with Implicit Feedback

A Gentle Introductio...

Recommender systems have become a very important part of the retail, social networking, and entertainment industries. From providing advice on songs for you to try, suggesting books for you to read, or finding clothes to buy, recommender systems have greatly improved the ability of customers to make choices more easily. Why is it so important […]

Factorization Machines for Recommendation Systems

Factorization Machin...

As a Data Scientist that works on Feed Personalization, I find it it important to stay up to date with the current state of Machine Learning and its applications. Most of the time, using some of the better-known recommendation algorithms yields good initial results; however, sometimes a change in the model is essential to provide customers […]

How to visualize decision trees in Python

How to visualize dec...

Decision tree classifier is the most popularly used supervised learning algorithm. Unlike other classification algorithms, decision tree classifier in not a black box in the modeling phase.  What that’s means, we can visualize the trained decision tree to understand how the decision tree gonna work for the give input features. So in this article, you […]

Integrating Pandas, Django REST Framework and Bokeh

Integrating Pandas, ...

It’s no secret that we love Django REST Framework. We’ve written quite a few blog posts about it and it is our default framework for projects that require a web API. Another package that we use a lot is Pandas (and NumPy by extension). It is fast, flexible, well documented and it has a very […]

Ad Hoc Distributed Random Forests #4

Ad Hoc Distributed R...

when arrays and dataframes aren’t flexible enough TL;DR. Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data. Our machine learning model does not perform well, but we do learn how to execute ad-hoc […]

Pandas on HDFS with Dask Dataframes #2

Pandas on HDFS with ...

In this post we use Pandas in parallel across an HDFS cluster to read CSV data. We coordinate these computations with dask.dataframe. A screencast version of this blogpost is available here and the previous post in this series is available here. This work was originally at matthewrocklin.com and is supported by Continuum Analytics and the XDATA […]

Ad Hoc Distributed Random Forests

Ad Hoc Distributed R...

when arrays and dataframes aren’t flexible enough TL;DR. Dask.distributed lets you submit individual tasks to the cluster. We use this ability combined with Scikit Learn to train and run a distributed random forest on distributed tabular NYC Taxi data. Our machine learning model does not perform well, but we do learn how to execute ad-hoc computations easily. Motivation […]

Twitter Pandas

Twitter Pandas...

Thanks to some great help from contributors, we’ve just pushed the first release of twitter pandas, v0.0.1. The first release is aimed at replicating the data-providing (no create/update/delete functions) from the tweepy API with the git-pandas style pandas interface. To install twitterpandas, just use pip pip install twitterpandas And then you can use it right […]