This is a presentation given for Data Science DC on Tuesday November 14, 2017. PDF slides PPTX slides Further resources up front: A Brief...

This is a presentation given for Data Science DC on Tuesday November 14, 2017.

Further resources up front:


Hi! I’m Aaron. All these slides, and a corresponding write-up (you’re reading it) are on my blog.

I work at Deep Learning Analytics (DLA). We do deep learning work for government and commercial customers. DLA is a great place to work, and one of the ways it’s great is that it sends us to conferences and such things.

DLA sent me to the first UC Berkeley Deep RL Bootcamp organized by Abbeel, Duan, Chen, and Karpathy. It was a great experience and it largely inspired this talk.

I have a separate summary write-up about my experience at the bootcamp, and they’ve since put up all the videos, slides, and labs, so you can see everything that was covered there.

The following goes through all the content of the talk:



The other major source for this talk is Sutton and Barto’s textbook, which I like a lot.

The picture shows the first edition, which is not what you want. The second edition is available free online, and was last updated about a week ago (November 5, 2017).

Sutton and Barto are major figures in reinforcement learning, and they do not follow any no original research rules, making their book really fairly exciting, if you’re not put off by the length (over 400 pages).

(The diagrams on the cover are not neural nets, but backup diagrams.)



  • applications: what
    • theory
  • applications: how
  • onward


The plan for today is to first mention four successful applications of reinforcement learning. Then we’ll go through a core of theory. This will let us then understand pretty completely how each of those applications is achieved. Finally, we’ll wrap up, looking at a few other applications and thoughts about how things are going.

applications: what

The applications here are all going to be games, not because reinforcement learning is only applicable to games, but because games are fun, and these examples are well known and cover a good range of techniques.

 width=First up, backgammon.

 width= Next, Atari. A lot of Atari games are well played by RL now. The ones shown (Video Pinball, Boxing, Breakout) are some of the ones that RL does the best on.

 width=I’m also including Tetris, mostly because it’s a chance to talk about an interesting technique.







And in the last two years, Go has been pretty much conquered by RL, so we’ll talk about that. width=


applications: what.theory

Let’s start to build up the theory of reinforcement learning.

This is going to start very gradually, but I promise that by the end we’ll be moving fast.

Yann LeCun’s cake

    • cake: unsupervised learning
    • icing: supervised learning
    • cherry: reinforcement learning


Yann LeCun introduced this cake idea for relating three main varieties of machine learning. It’s largely based on one view of how much information is used at each training step.

I’m going to use it to build up and relate these three kinds of learning, while introducing reinforcement learning notation.

unsupervised learning

      • ss

In unsupervised learning, we have a collection of states, where each individual state can be referred to with ss.

I’m using “state” without distinguishing “state” from “observation”. You could also call these “examples” or “covariates” or “data points” or whatever you like.

The symbol “x” is commonly used.

state ss

      • numbers
      • text (as numbers)
      • image (as numbers)
      • sound (as numbers)

States can be anything as long as it can be expressed numerically. So that includes text, images, and sound. Really anything.

unsupervised (?) learning

      • given ss
      • learn scluster_ids→cluster_id
      • learn sss→s

So say we have a set of a thousand images. Each image is an ss.

We want to learn something, and that tends to mean unsupervised learning starts to resemble supervised learning.

At two ends of a spectrum we have clustering and autoencoders, and all kinds of dimensionality reduction in between.

Unsupervised learning is sort of the dark matter of machine learning. Even Yann LeCunsays “We are missing the principles for unsupervised learning.”

deep unsupervised learning

      • ss with deep neural nets

Deep unsupervised learning is whenever we do unsupervised learning and somewhere there’s a deep neural net.

supervised learning

      • sas→a

Up next is supervised learning. We’re introducing a new entity aa, which I’ll call an “action”. It’s common to call it a “label” or a “target” and to use the symbol “y”. Same thing.

action a

      • numbers
      • “cat”/”dog”
      • “left”/”right”
      • 17.3
      • [2.0, 11.7, 5]
      • 4.2V

Whatever you call it, the action is again a numeric thing. It could be anything that ss could be, but it tends to be lower-dimensional.

The cat/dog classifier is a popular example, and a left/right classifier is just the same, but those might feel more like actions.

supervised learning

      • given s,as,a
      • learn sas→a

In supervised learning you have a training set of state-action pairs, and you try to learn a function to produce the correct action based on the state alone.

Supervised learning can blur into imitation learning, which can be taken as a kind of reinforcement learning. For example, NVIDIA’s end-to-end self-driving car is based on learning human driving behaviors. (Sergey Levine explains in some depth.) But I’m not going to talk more about imitation learning, and supervised learning will stand alone.

You can learn this function with linear regression or support vector machines or whatever you like.

deep supervised learning

      • sas→a with deep neural nets

Deep supervised learning is whenever we do supervised learning and somewhere there’s a deep neural net.

There’s also semi-supervised learning, when you have some labeled data and some unlabeled data, which can connect to active learning, which has some relation to reinforcement learning, but that’s all I’ll say about that.

reinforcement learning

      • r,sar,s→a

Finally, we reach reinforcement learning.

We’re adding a new thing rr, which is reward.

reward rr

      • -3
      • 0
      • 7.4
      • 1

Reward is a scalar, and we like positive rewards.

optimal control / reinforcement learning

I’ll mention that optimal control is closely related to reinforcement learning. It has its own parallel notation and conventions, and I’m going to ignore all that.


So here’s the reinforcement learning setting.

We get a reward and a state, and the agent chooses an action.


Then, time passes. We’re using discrete time, so this is a “tick”.


Then we get a new reward and state, which depend on the previous state and action, and the agent chooses a new action.


And so on.


This is the standard reinforcement learning diagram, showing the agent and environment. My notation is similar.

reinforcement learning

      • “given” 
Aaron Schumacher

Aaron Schumacher

Aaron Schumacher is a data scientist and software engineer for Deep Learning Analytics. He has taught with Python and R for General Assembly and the Metis data science bootcamp. Aaron has also worked with data at Booz Allen Hamilton, New York University, and the New York City Department of Education. He studied mathematics at the University of Wisconsin–Madison and teaching mathematics at Bard College. Aaron's career-best breakdancing result was advancing to the semi-finals of the R16 Korea 2009 individual footwork battle.