Individuals interested in reinforcement learning crowded into a room at ODSC Europe 2018. There, Badoo’s lead data scientist Leonardo De Marchi hosted a four-hour workshop to guide attendees through the first steps.
What is reinforcement learning?
Reinforcement learning is one machine learning approach. Most people know of supervised and unsupervised learning. In supervised learning we have labeled data — we know the results already. So we build models that learn from the input data and try to predict answers close to the labeled results.
In unsupervised learning, we don’t have labeled results. Instead, we try to get insights from patterns in the datasets.
In the lesser-known reinforcement learning, we train an agent that lives in a world (the environment). The agent can perform certain actions, and in turn, it receives responses from the environment. Based on those responses, it learns how to behave in the environment to maximize its reward. That is how we teach an agent to succeed in a task.
In simple terms, reinforcement learning intends to train an agent to maximize future rewards based on its actions on its environment.
At the beginning of its training, an agent has zero knowledge about the environment, the actions it can perform, and the types of rewards it can get. The only way it will learn is through exploration and exploitation.
De Marchi explained RL algorithms require a balance between the two. Exploration involves gathering information and observations from the environment. Exploitation occurs through the actions the agent makes.
Workshop participants used OpenAI Gym, a platform that allows you to test agents in different environments. The gym concept to train and test RL algorithms mirrors humans training in regular gyms.
The workshop started with examples of successful reinforcement learning applications. For example, AlphaGo — the first computer program to beat the human champion in the board game Go.
We also tackled the multi-armed bandit problem. A gambler wants to maximize their reward with a set of betting machines at a casino, but needs to decide the best course of action to minimize the cost. Multi-armed bandit types of problems happen in a stateless environment — we don’t have to keep track of the environment state because the action determines the reward.
The approach to training an agent depends on the type of environment. In any case, we start with only an environment and the agent, which only knows the state of the environment at each step. Through an algorithm, the agent takes actions that will impact the environment. Then a reward function gives the agent feedback.
An important and oftentimes difficult component of RL is to define a good reward function that will allow the agent to learn from the environment and its actions. Without previous environmental knowledge, action-based rewards must be maximized for the agent to learn. That is why the learning curve starts very low, and rises as the agent receives better rewards for its actions and identifies useful information about the environment.
Later in the workshop, we covered the Markov Decision Processes, where we added states to the world and tasks could be continuous. As an example, we tried to create an agent to solve the frozen lake exercise. We implemented the State-Action-Reward-State-Action — or SARSA — algorithm, an RL strategy that learns how to perform a task.
The workshop covered several topics and concepts regarding reinforcement learning that made it easy to start applying the examples in new scenarios.
Reinforcement learning is a great new skill to add to your toolbox as a data scientist. Don’t worry if you don’t know when you would apply RL — you will find plenty of examples where it is a suitable selection.