Further resources up front:
- A Brief Survey of Deep Reinforcement Learning (paper)
- Karpathy’s Pong from Pixels (blog post)
- Reinforcement Learning: An Introduction (textbook)
- David Silver’s course (videos and slides)
- Deep Reinforcement Learning Bootcamp (videos, slides, and labs)
- OpenAI gym / baselines (software)
- National Go Center (physical place)
- Hack and Tell (fun meetup)
Hi! I’m Aaron. All these slides, and a corresponding write-up (you’re reading it) are on my blog.
I work at Deep Learning Analytics (DLA). We do deep learning work for government and commercial customers. DLA is a great place to work, and one of the ways it’s great is that it sends us to conferences and such things.
DLA sent me to the first UC Berkeley Deep RL Bootcamp organized by Abbeel, Duan, Chen, and Karpathy. It was a great experience and it largely inspired this talk.
The following goes through all the content of the talk:
The other major source for this talk is Sutton and Barto’s textbook, which I like a lot.
Sutton and Barto are major figures in reinforcement learning, and they do not follow any no original research rules, making their book really fairly exciting, if you’re not put off by the length (over 400 pages).
(The diagrams on the cover are not neural nets, but backup diagrams.)
The plan for today is to first mention four successful applications of reinforcement learning. Then we’ll go through a core of theory. This will let us then understand pretty completely how each of those applications is achieved. Finally, we’ll wrap up, looking at a few other applications and thoughts about how things are going.
The applications here are all going to be games, not because reinforcement learning is only applicable to games, but because games are fun, and these examples are well known and cover a good range of techniques.
And in the last two years, Go has been pretty much conquered by RL, so we’ll talk about that.
Let’s start to build up the theory of reinforcement learning.
This is going to start very gradually, but I promise that by the end we’ll be moving fast.
Yann LeCun’s cake
- cake: unsupervised learning
- icing: supervised learning
- cherry: reinforcement learning
Yann LeCun introduced this cake idea for relating three main varieties of machine learning. It’s largely based on one view of how much information is used at each training step.
I’m going to use it to build up and relate these three kinds of learning, while introducing reinforcement learning notation.
In unsupervised learning, we have a collection of states, where each individual state can be referred to with s.
I’m using “state” without distinguishing “state” from “observation”. You could also call these “examples” or “covariates” or “data points” or whatever you like.
The symbol “x” is commonly used.
- text (as numbers)
- image (as numbers)
- sound (as numbers)
States can be anything as long as it can be expressed numerically. So that includes text, images, and sound. Really anything.
unsupervised (?) learning
- given s
- learn s→cluster_id
- learn s→s
So say we have a set of a thousand images. Each image is an s.
We want to learn something, and that tends to mean unsupervised learning starts to resemble supervised learning.
At two ends of a spectrum we have clustering and autoencoders, and all kinds of dimensionality reduction in between.
deep unsupervised learning
- s with deep neural nets
Deep unsupervised learning is whenever we do unsupervised learning and somewhere there’s a deep neural net.
Up next is supervised learning. We’re introducing a new entity a, which I’ll call an “action”. It’s common to call it a “label” or a “target” and to use the symbol “y”. Same thing.
- [2.0, 11.7, 5]
Whatever you call it, the action is again a numeric thing. It could be anything that s could be, but it tends to be lower-dimensional.
The cat/dog classifier is a popular example, and a left/right classifier is just the same, but those might feel more like actions.
- given s,a
- learn s→a
In supervised learning you have a training set of state-action pairs, and you try to learn a function to produce the correct action based on the state alone.
Supervised learning can blur into imitation learning, which can be taken as a kind of reinforcement learning. For example, NVIDIA’s end-to-end self-driving car is based on learning human driving behaviors. (Sergey Levine explains in some depth.) But I’m not going to talk more about imitation learning, and supervised learning will stand alone.
You can learn this function with linear regression or support vector machines or whatever you like.
deep supervised learning
- s→a with deep neural nets
Deep supervised learning is whenever we do supervised learning and somewhere there’s a deep neural net.
There’s also semi-supervised learning, when you have some labeled data and some unlabeled data, which can connect to active learning, which has some relation to reinforcement learning, but that’s all I’ll say about that.
Finally, we reach reinforcement learning.
We’re adding a new thing r, which is reward.
Reward is a scalar, and we like positive rewards.
optimal control / reinforcement learning
I’ll mention that optimal control is closely related to reinforcement learning. It has its own parallel notation and conventions, and I’m going to ignore all that.
So here’s the reinforcement learning setting.
We get a reward and a state, and the agent chooses an action.
Then, time passes. We’re using discrete time, so this is a “tick”.
Then we get a new reward and state, which depend on the previous state and action, and the agent chooses a new action.
And so on.
This is the standard reinforcement learning diagram, showing the agent and environment. My notation is similar.
Aaron Schumacher is a data scientist and software engineer for Deep Learning Analytics. He has taught with Python and R for General Assembly and the Metis data science bootcamp. Aaron has also worked with data at Booz Allen Hamilton, New York University, and the New York City Department of Education. He studied mathematics at the University of Wisconsin–Madison and teaching mathematics at Bard College. Aaron's career-best breakdancing result was advancing to the semi-finals of the R16 Korea 2009 individual footwork battle.
- Uses of AI in Finance in 2020 254 views | by ODSC Team | under Business + Management, Financial Services
- Be or Not to be an Anomaly? 192 views | by ODSC Community | under Conferences, Modeling, ODSC Speaker
- Automatic Differentiation in PyTorch 171 views | by ODSC Community | under Conferences, Machine Learning, Modeling, ODSC Speaker