Editor’s note: Oliver is a speaker for ODSC West 2021. Be sure to check out his talk, “Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0,” there!
Reinforcement Learning has a special place in the world of machine learning. Different from other forms of machine learning like supervised or unsupervised learning, reinforcement learning does not need any existing data, but rather generates that data by doing experiments in a predefined environment. Experiments are guided by an objective that can be externally given as a reward, or can be internal like “explore” or “do not get bored.”
This is illustrated in figure 1. An agent performs actions in a given environment. Which action to take is decided by a policy that has an observation of the environment as an input. Based on the rewards mostly given by the environment, the policy might change over time. This is best illustrated as a game play: your gaming agent might be a hedgehog or a plumber and they might be put into a maze-like playing environment. What is displayed on your screen is the current observation and the change in score as a result of an action might be the reward. Actions are each move or jump of your agent.
There are a number of different learning algorithms for reinforcement learning and even different categories forming a taxonomy (https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#a-taxonomy-of-rl-algorithms). However, in settings where experiments are safe and cheap to perform (that is in most settings where the experiments are carried out in a simulated environment) the Policy Optimization algorithm PPO (Proximal Policy Optimization) is the learning algorithm of choice. As shown in figure 2 PPO works in two phases. In phase one an agent gathers data by playing a sequence of actions in the given environment. This is repeated in a given number of episodes. The agent usually starts with a random policy much like a child will begin a game to find out how it works. Based on the trajectories the agent takes, this policy is refined in the second step. Loosely speaking the policy will do more of the actions that have a better outcome in terms of higher rewards.
Figure 2: PPO works in two phases
All advanced policy optimization algorithms, not only PPO, make sure that in the second phase the policy does not get traumatized on strange experiences that would in effect let it spiral down in bad moves and prohibit a good learning outcome. This is done by limiting how much the policy can be updated in each training epoch. Such hard constraints are however difficult to implement and do not fit well in the world of standard supervised machine learning. PPOs magic lies in how it translates that hard constraint to penalties that merely are additional losses for standard deep learning components. In figure 3 you can see the two neural networks PPO uses to implement its agents’ behavior and learning. The policy network takes the observation as an input and decides which action to take. This is the only network you will need in production when the training is done. The second network stabilizes training and makes the outcoming policy more general. Additional to the mentioned penalty that is expressed as a loss each network also contributes to the overall loss that can be minimized by standard backpropagation-based training.
If you want to know more about reinforcement learning with PPO, join the half-day hands-on training at ODSC-West 2021. Based on what you learned here there will be a deep dive explaining all different losses and tuning options using the TF-Agents implementation of PPO and TensorFlow 2. The workshop addresses all levels of difficulty. Depending on your existing experience you will either just grasp the idea of PPO and reinforcement learning, make more advanced experiments or even dig into deeper concepts.
About the author/ODSC West 2021 speaker on Reinforcement Learning with PPO:
Oliver Zeigermann is a software developer from Hamburg Germany and has been a practitioner for more than 3 decades. He specializes in frontend development and machine learning. He is the author of many video courses and textbooks.