Principal Component Analysis Tutorial

Principal Component ...

The Problem Imagine that you are a nutritionist trying to explore the nutritional content of food. What is the best way to differentiate food items? By vitamin content? Protein levels? Or perhaps a combination of both? Knowing the variables that best differentiate your items has several uses: 1. Visualization. Using the right variables to plot […]

Implementing a Principal Component Analysis (PCA) in Python, step by step

Implementing a Princ...

Sections Sections Introduction Principal Component Analysis (PCA) Vs. Multiple Discriminant Analysis (MDA) What is a “good” subspace? Summarizing the PCA approach Generating some 3-dimensional sample data Why are we chosing a 3-dimensional sample? 1. Taking the whole dataset ignoring the class labels 2. Computing the d-dimensional mean vector 3. a) Computing the Scatter Matrix 3. […]

Installing Jupyter with the PySpark and R kernels for Spark development

Installing Jupyter w...

This is a quick tutorial on installing Jupyter and setting up the PySpark and the R kernel (IRkernel) for Spark development. The pre-reqs for following this tutorial is to have a Hadoop/Spark cluster deployed and the relevant services up and running (e.g. HDFS, YARN, Hive, Spark etc.). In this tutorial I am using IBM’s Hadoop […]

Scikit-learn Tutorial: Statistical-Learning for Scientific Data Processing

Scikit-learn Tutoria...

Zip file for off-line browsing: https://github.com/GaelVaroquaux/scikit-learn-tutorial/zipball/gh-pages Statistical learning Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset. This tutorial […]

1. Statistical Learning: The Setting and the Estimator Object in Scikit-learn

1. Statistical Learn...

1.1. Datasets The scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis. A simple example shipped with the […]

Recurrent Neural Networks Tutorial, Part 3 – Backpropagation through time and vanishing gradients

Recurrent Neural Net...

In the previous part of the tutorial we implemented a RNN from scratch, but didn’t go into detail on how Backpropagation Through Time (BPTT) algorithms calculates the gradients. In this part we’ll give a brief overview of BPTT and explain how it differs from traditional backpropagation. We will then try to understand the vanishing gradient […]

Intro to Recurrent Neural Networks #2 – Implement an RNN

Intro to Recurrent N...

This the second part of the Recurrent Neural Network Tutorial. The first part is here. Code to follow along is on Github. In this part we will implement a full Recurrent Neural Network from scratch using Python and optimize our implementation using Theano, a library to perform operations on a GPU. The full code is […]

Intro to Recurrent Neural Networks, #1

Intro to Recurrent N...

Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks. But despite their recent popularity I’ve only found a limited number of resources that throughly explain how RNNs work, and how to implement them. That’s what this tutorial is about. It’s a multi-part series in which I’m planning to […]

Improved vtreat Documentation

Improved vtreat Docu...

Nina Zumel has donated some time to greatly improve the vtreat R package documentation (now available as pre-rendered HTML here). vtreat is an R data.frame processor/conditioner package that helps prepare real-world data for predictive modeling in a statistically sound manner. Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, […]