Chapter 2: Machine Learning In the last chapter we were concerned with real-valued circuits that computed possibly complex expressions of their... HACKER’S GUIDE TO NEURAL NETWORKS, #2

Chapter 2: Machine Learning

In the last chapter we were concerned with real-valued circuits that computed possibly complex expressions of their inputs (the forward pass), and also we could compute the gradients of these expressions on the original inputs (backward pass). In this chapter we will see how useful this extremely simple mechanism is in Machine Learning.

This post was originally posted here on karpathy.github.io/

Binary Classification

As we did before, lets start out simple. The simplest, common and yet very practical problem in Machine Learning is binary classification. A lot of very interesting and important problems can be reduced to it. The setup is as follows: We are given a dataset of N vectors and every one of them is labeled with a +1 or a -1. For example, in two dimensions our dataset could look as simple as:

vector -> label
[1.2, 0.7] -> +1
[-0.3, 0.5] -> -1
[-3, -1] -> +1
[0.1, 1.0] -> -1
[3.0, 1.1] -> -1
[2.1, -3] -> +1

Here, we have N = 6 datapoints, where every datapoint has two features (D = 2). Three of the datapoints have label +1 and the other three label -1. This is a silly toy example, but in practice a +1/-1 dataset could be very useful things indeed: For example spam/no spam emails, where the vectors somehow measure various features of the content of the email, such as the number of times certain enhancement drugs are mentioned.

Goal. Our goal in binary classification is to learn a function that takes a 2-dimensional vector and predicts the label. This function is usually parameterized by a certain set of parameters, and we will want to tune the parameters of the function so that its outputs are consistent with the labeling in the provided dataset. In the end we can discard the dataset and use the learned parameters to predict labels for previously unseen vectors.

Training protocol

We will eventually build up to entire neural networks and complex expressions, but lets start out simple and train a linear classifier very similar to the single neuron we saw at the end of Chapter 1. The only difference is that we’ll get rid of the sigmoid because it makes things unnecessarily complicated (I only used it as an example in Chapter 1 because sigmoid neurons are historically popular but modern Neural Networks rarely, if ever, use sigmoid non-linearities). Anyway, lets use a simple linear function:


In this expression we think of x and y as the inputs (the 2D vectors) and a,b,c as the parameters of the function that we will want to learn. For example, if a = 1, b = -2, c = -1, then the function will take the first datapoint ([1.2, 0.7]) and output 1 * 1.2 + (-2) * 0.7 + (-1) = -1.2. Here is how the training will work:

  1. We select a random datapoint and feed it through the circuit
  2. We will interpret the output of the circuit as a confidence that the datapoint has class +1. (i.e. very high values = circuit is very certain datapoint has class +1 and very low values = circuit is certain this datapoint has class -1.)
  3. We will measure how well the prediction aligns with the provided labels. Intuitively, for example, if a positive example scores very low, we will want to tug in the positive direction on the circuit, demanding that it should output higher value for this datapoint. Note that this is the case for the the first datapoint: it is labeled as +1 but our predictor unction only assigns it value -1.2. We will therefore tug on the circuit in positive direction; We want the value to be higher.
  4. The circuit will take the tug and backpropagate it to compute tugs on the inputs a,b,c,x,y
  5. Since we think of x,y as (fixed) datapoints, we will ignore the pull on x,y. If you’re a fan of my physical analogies, think of these inputs as pegs, fixed in the ground.
  6. On the other hand, we will take the parameters a,b,c and make them respond to their tug (i.e. we’ll perform what we call a parameter update). This, of course, will make it so that the circuit will output a slightly higher score on this particular datapoint in the future.
  7. Iterate! Go back to step 1.

The training scheme I described above, is commonly referred as Stochastic Gradient Descent. The interesting part I’d like to reiterate is that a,b,c,x,y are all made up of the same stuff as far as the circuit is concerned: They are inputs to the circuit and the circuit will tug on all of them in some direction. It doesn’t know the difference between parameters and datapoints. However, after the backward pass is complete we ignore all tugs on the datapoints (x,y) and keep swapping them in and out as we iterate over examples in the dataset. On the other hand, we keep the parameters (a,b,c) around and keep tugging on them every time we sample a datapoint. Over time, the pulls on these parameters will tune these values in such a way that the function outputs high scores for positive examples and low scores for negative examples.

Learning a Support Vector Machine

As a concrete example, lets learn a Support Vector Machine. The SVM is a very popular linear classifier; Its functional form is exactly as I’ve described in previous section, f(x,y)=ax+by+cf(x,y)=ax+by+c. At this point, if you’ve seen an explanation of SVMs you’re probably expecting me to define the SVM loss function and plunge into an explanation of slack variables, geometrical intuitions of large margins, kernels, duality, etc. But here, I’d like to take a different approach. Instead of definining loss functions, I would like to base the explanation on the force specification (I just made this term up by the way) of a Support Vector Machine, which I personally find much more intuitive. As we will see, talking about the force specification and the loss function are identical ways of seeing the same problem. Anyway, here it is:

Support Vector Machine “Force Specification”:

  • If we feed a positive datapoint through the SVM circuit and the output value is less than 1, pull on the circuit with force +1. This is a positive example so we want the score to be higher for it.
  • Conversely, if we feed a negative datapoint through the SVM and the output is greater than -1, then the circuit is giving this datapoint dangerously high score: Pull on the circuit downwards with force -1.
  • In addition to the pulls above, always add a small amount of pull on the parameters a,b (notice, not on c!) that pulls them towards zero. You can think of both a,b as being attached to a physical spring that is attached at zero. Just as with a physical spring, this will make the pull proprotional to the value of each of a,b (Hooke’s law in physics, anyone?). For example, if a becomes very high it will experience a strong pull of magnitude |a| back towards zero. This pull is something we call regularization, and it ensures that neither of our parameters a or b gets disproportionally large. This would be undesirable because both a,b get multiplied to the input features x,y (remember the equation is a*x + b*y + c), so if either of them is too high, our classifier would be overly sensitive to these features. This isn’t a nice property because features can often be noisy in practice, so we want our classifier to change relatively smoothly if they wiggle around.

Lets quickly go through a small but concrete example. Suppose we start out with a random parameter setting, say, a = 1, b = -2, c = -1. Then:

  • If we feed the point [1.2, 0.7], the SVM will compute score 1 * 1.2 + (-2) * 0.7 - 1 = -1.2. This point is labeled as +1 in the training data, so we want the score to be higher than 1. The gradient on top of the circuit will thus be positive: +1, which will backpropagate to a,b,c. Additionally, there will also be a regularization pull on a of -1 (to make it smaller) and regularization pull on b of +2 to make it larger, toward zero.
  • Suppose instead that we fed the datapoint [-0.3, 0.5] to the SVM. It computes 1 * (-0.3) + (-2) * 0.5 - 1 = -2.3. The label for this point is -1, and since -2.3 is smaller than -1, we see that according to our force specification the SVM should be happy: The computed score is very negative, consistent with the negative label of this example. There will be no pull at the end of the circuit (i.e it’s zero), since there no changes are necessary. However, there will still be the regularization pull on a of -1 and on b of +2.

Okay there’s been too much text. Lets write the SVM code and take advantage of the circuit machinery we have from Chapter 1:

// A circuit: it takes 5 Units (x,y,a,b,c) and outputs a single Unit
// It can also compute the gradient w.r.t. its inputs
var Circuit = function() {
  // create some gates
  this.mulg0 = new multiplyGate();
  this.mulg1 = new multiplyGate();
  this.addg0 = new addGate();
  this.addg1 = new addGate();
Circuit.prototype = {
  forward: function(x,y,a,b,c) {
    this.ax = this.mulg0.forward(a, x); // a*x
    this.by = this.mulg1.forward(b, y); // b*y
    this.axpby = this.addg0.forward(this.ax, this.by); // a*x + b*y
    this.axpbypc = this.addg1.forward(this.axpby, c); // a*x + b*y + c
    return this.axpbypc;
  backward: function(gradient_top) { // takes pull from above
    this.axpbypc.grad = gradient_top;
    this.addg1.backward(); // sets gradient in axpby and c
    this.addg0.backward(); // sets gradient in ax and by
    this.mulg1.backward(); // sets gradient in b and y
    this.mulg0.backward(); // sets gradient in a and x

That’s a circuit that simply computes a*x + b*y + c and can also compute the gradient. It uses the gates code we developed in Chapter 1. Now lets write the SVM, which doesn’t care about the actual circuit. It is only concerned with the values that come out of it, and it pulls on the circuit.

// SVM class
var SVM = function() {
  // random initial parameter values
  this.a = new Unit(1.0, 0.0); 
  this.b = new Unit(-2.0, 0.0);
  this.c = new Unit(-1.0, 0.0);

  this.circuit = new Circuit();
SVM.prototype = {
  forward: function(x, y) { // assume x and y are Units
    this.unit_out = this.circuit.forward(x, y, this.a, this.b, this.c);
    return this.unit_out;
  backward: function(label) { // label is +1 or -1

    // reset pulls on a,b,c
    this.a.grad = 0.0; 
    this.b.grad = 0.0; 
    this.c.grad = 0.0;

    // compute the pull based on what the circuit output was
    var pull = 0.0;
    if(label === 1 && this.unit_out.value < 1) { 
      pull = 1; // the score was too low: pull up
    if(label === -1 && this.unit_out.value > -1) {
      pull = -1; // the score was too high for a positive example, pull down
    this.circuit.backward(pull); // writes gradient into x,y,a,b,c
    // add regularization pull for parameters: towards zero and proportional to value
    this.a.grad += -this.a.value;
    this.b.grad += -this.b.value;
  learnFrom: function(x, y, label) {
    this.forward(x, y); // forward pass (set .value in all Units)
    this.backward(label); // backward pass (set .grad in all Units)
    this.parameterUpdate(); // parameters respond to tug
  parameterUpdate: function() {
    var step_size = 0.01;
    this.a.value += step_size * this.a.grad;
    this.b.value += step_size * this.b.grad;
    this.c.value += step_size * this.c.grad;

Now lets train the SVM with Stochastic Gradient Descent:

var data = []; var labels = [];
data.push([1.2, 0.7]); labels.push(1);
data.push([-0.3, -0.5]); labels.push(-1);
data.push([3.0, 0.1]); labels.push(1);
data.push([-0.1, -1.0]); labels.push(-1);
data.push([-1.0, 1.1]); labels.push(-1);
data.push([2.1, -3]); labels.push(1);
var svm = new SVM();

// a function that computes the classification accuracy
var evalTrainingAccuracy = function() {
var num_correct = 0;

Andrej Karapathy

As of Summer 2016 I am a Research Scientist at OpenAI working on Deep Learning, Generative Models and Reinforcement Learning. Previously I was a Computer Science PhD student at Stanford, working with Fei-Fei Li. My research centered around Deep Learning and its applications in Computer Vision, Natural Language Processing and their intersection. In particular, I was interested in fully end-to-end learning with Convolutional/Recurrent Neural Networks architectures and recent advances in Deep Reinforcement Learning. Over the course of my PhD I squeezed in two internships at Google where I worked on large-scale feature learning over YouTube videos, and last summer I interned at DeepMind and worked on Deep Reinforcement Learning and Generative Models. Together with Fei-Fei, I designed and taught a new Stanford undergraduate-level class on Convolutional Neural Networks for Visual Recognition (CS231n). The class was the first Deep Learning course offering at Stanford and has grown from 150 enrolled students last year to 330 students this year. On a side for fun I blog, tweet, and maintain several Deep Learning libraries written in Javascript (e.g. ConvNetJS, RecurrentJS, REINFORCEjs, t-sneJS). I am also sometimes jokingly referred to as the reference human for ImageNet (post :)), and I create those nice-looking conference proceedings LDA visualization pages each year (NIPS 2015 example). I also recently expanded on this with arxiv-sanity.com, which lets you search and sort through 20,000+ Arxiv papers on Machine Learning over the last 3 years in the same pretty format.