Installing Jupyter with the PySpark and R kernels for Spark development

Installing Jupyter w...

This is a quick tutorial on installing Jupyter and setting up the PySpark and the R kernel (IRkernel) for Spark development. The pre-reqs for following this tutorial is to have a Hadoop/Spark cluster deployed and the relevant services up and running (e.g. HDFS, YARN, Hive, Spark etc.). In this tutorial I am using IBM’s Hadoop […]

Scikit-learn Tutorial: Statistical-Learning for Scientific Data Processing

Scikit-learn Tutoria...

Zip file for off-line browsing: https://github.com/GaelVaroquaux/scikit-learn-tutorial/zipball/gh-pages Statistical learning Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset. This tutorial […]

1. Statistical Learning: The Setting and the Estimator Object in Scikit-learn

1. Statistical Learn...

1.1. Datasets The scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis. A simple example shipped with the […]

Recurrent Neural Networks Tutorial, Part 3 – Backpropagation through time and vanishing gradients

Recurrent Neural Net...

In the previous part of the tutorial we implemented a RNN from scratch, but didn’t go into detail on how Backpropagation Through Time (BPTT) algorithms calculates the gradients. In this part we’ll give a brief overview of BPTT and explain how it differs from traditional backpropagation. We will then try to understand the vanishing gradient […]

Intro to Recurrent Neural Networks #2 – Implement an RNN

Intro to Recurrent N...

This the second part of the Recurrent Neural Network Tutorial. The first part is here. Code to follow along is on Github. In this part we will implement a full Recurrent Neural Network from scratch using Python and optimize our implementation using Theano, a library to perform operations on a GPU. The full code is […]

Intro to Recurrent Neural Networks, #1

Intro to Recurrent N...

Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks. But despite their recent popularity I’ve only found a limited number of resources that throughly explain how RNNs work, and how to implement them. That’s what this tutorial is about. It’s a multi-part series in which I’m planning to […]

Improved vtreat Documentation

Improved vtreat Docu...

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here). vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, […]

Intro to Data Science

Intro to Data Scienc...

Table of Contents 0.0 Setup 0.1 Python & Pip 0.2 R & R Studio 0.3 Other 1.0 Background 1.1 What is Data Science? 1.1.1 What do you mean by data? 1.2 Is data science the same as machine learning? 1.3 Why is Data Science important? 2.0 Data Science Process 2.1 What is a “Data Science” […]