Imagine that you want to learn to ride a bike and ask a friend for advice. They explain how the gears work, how to release the brake and a few other technical details. In the end, you ask the secret to keeping your balance. What kind of answer do you expect? In an imaginary supervised world, you should be able to perfectly quantify your actions and correct errors by comparing the outcomes with precise reference values. In the real world, you have no idea about the quantities underlying your actions and, above all, you will never know what the right value is.
This article is an excerpt from the book Mastering Machine Learning Algorithms, Second Edition by Giuseppe Bonaccorso, a newly updated and revised edition of the bestselling guide to exploring and mastering the most important algorithms for solving complex machine learning problems. This article helps you understand the fundamental concepts of Reinforcement learning.
Increasing the level of abstraction, the scenario we’re considering can be described as: a generic agent performs actions inside an environment and receives feedback that is somehow proportional to the competence of its actions. According to this feedback, the agent can correct its actions in order to reach a specific goal. This basic schema is represented in the following diagram:
Returning to our initial example, when you ride a bike for the first time and try to keep your balance, you will notice that the wrong movement causes an increase in the slope, which in turn increases the horizontal component of the gravitational force, pushing the bike laterally. As the vertical component is compensated, the result is a rotation that ends when the bike falls to the ground completely. However, as you can use your legs to control your balance, when the bike starts falling, thanks to Newton’s third law, the force on your leg increases and your brain understands that it’s necessary to make a movement in the opposite direction.
Even though this problem can be easily expressed in terms of physical laws, nobody learns to ride a bike by computing forces and momentums. This is one of the main concepts of RL: an agent must always make its choices considering a piece of information, usually defined as a reward that represents the response provided by the environment. If the action is correct, the reward will be positive, otherwise, it will be negative. After receiving a reward, an agent can fine-tune the strategy, called the policy, in order to maximize the expected future reward.
For example, after a few rides, you will be able to slightly move your body so as to keep your balance while turning, but in the beginning, you’ll probably need to extend your leg to avoid falling down. Hence, your initial policy suggested an incorrect action, which received repeated negative rewards; so, your brain corrected it by increasing the probability of choosing another action. The implicit hypothesis that underlies this approach is that an agent is always rational, meaning that its goal is to maximize the expected return of its actions (nobody decides to fall off their bike just to find out how it feels).
Before discussing the single components of an RL system, it’s necessary to add a couple of fundamental assumptions. The first one is that an agent can repeat experiences an infinite number of times. In other words, we assume that it’s possible to learn a valid policy (possibly the optimal one) only if we have enough time. Clearly, this is unacceptable in the animal world, and we all know that many experiences are extremely dangerous; however, this assumption is necessary to prove the convergence of some algorithms. Indeed, sub-optimal policies can sometimes be learned very quickly, but it’s necessary to iterate many times to reach the optimal one.
In real artificial systems, we always stop the learning process after a finite number of iterations, but it’s almost impossible to find valid solutions if some experiences prevent the agent from continuing to interact with the environment. As many tasks have final states (either positive or negative), we assume that the agent can play any number of episodes (somewhat analogous to the epochs of supervised learning), exploiting the experience previously learned.
The second assumption is a little bit more technical and it’s usually known as the Markov property.
The Markov Decision Process
When the agent interacts with the environment, it observes a sequence of states. Even though it might seem like an oxymoron, we assume that each state is stateful. We can explain this concept with a simple example; suppose that you’re filling a tank and every 5 seconds you measure the level. Imagine that at t = 0, the level L = 10 and the water is flowing in. What do you expect at t = 1? Obviously, L > 10. In other words, without external unknown causes, we assume that a state contains the previous history, so that the sequence, even if discretized, represents a continuous evolution where no jumps are allowed.
When an RL task satisfies this property, it’s called a Markov Decision Process (MDP) and it’s very easy to employ simple algorithms to evaluate the actions. Luckily, the majority of natural events can be modeled as MDPs (when you’re walking toward a door, every step in the right direction must decrease the distance), but there are some games that are implicitly stateless.
For example, if you want to employ an RL algorithm to learn how to guess the outcome of a probabilistic sequence of independent events (such as tossing a coin), the result could be dramatically wrong. The reason is clear: any state is independent of the previous ones and every attempt to build up a history is a failure. Therefore, if you observe a sequence of 0, 0, 0, 0… you are not justified in increasing the value of betting on 0 unless, after considering the likelihood of the events, you suppose that the coin is loaded.
However, if there’s no reason to do so, the process isn’t an MDP and every episode (event) is completely independent. All the assumptions that we, either implicitly or explicitly, make are based on this fundamental concept, so pay attention when evaluating new, unusual scenarios because you may discover that the employment of a specific algorithm isn’t theoretically justified.
The environment is the entity within which the agent has to reach its goals. For our purposes, a generic environment is a system that receives an input action, at (we use the index t because this is a natural time process), and outputs a tuple composed of a state, st+1, and a reward, rt+1. These two elements are the only pieces of information provided to the agent to make its next decision. If we are working with an MDP and the sets of possible actions, A, and states, S, are discrete and finite, the problem is a defined finite MDP (in many continuous cases, it’s possible to treat the problem as a finite MDP by discretizing the spaces). If there are final states, the task is called episodic and, in general, the goal is to reach a positive final state in the shortest amount of time or maximize a score. The schema of the cyclic interaction between an agent and an environment is shown in the following diagram:
A very important feature of an environment is its internal nature. It can be either deterministic or stochastic. A deterministic environment is characterized by a function that associates each possible action in a specific state st to a well-defined successor st+1, with a precise reward rt+1:
Conversely, a stochastic environment is characterized by a transition probability between the current state st and a set of possible successors sit+1 given an action at:
If a state si has a transitional probability Tst,st+1i,at=1 ∀ at∈A, the state is defined as absorbing. In general, all ending states in episodic tasks are modeled as absorbing ones, to avoid any further transition. When an episode is not limited to a fixed number of steps, the only criterion to determine its end is to check whether the agent has reached an absorbing state. As we don’t know which state will be the successor, it’s necessary to consider the expected value of all possible rewards considering the initial state st and the action at:
In general, it’s easier to manage stochastic environments because they can be immediately converted into deterministic ones by setting all probabilities to zero except the one corresponding to the actual successor (for example, T(st,st+1i,at)=(0,0,..,0,1,0,…,0)).
In the same way, the expected return can be set equal to rt+1. Knowledge of T(st,st+1i,a)t, as well as E[rt+1i], is necessary to employ some specific algorithms, but it can become problematic when finding a suitable model for the environment requires extremely complex analysis. In all those cases, model-free methods can be employed and, therefore, the environment is considered as a black box, whose output at time t (subsequent to an action performed by the agent at-1), is the only available piece of information for the evaluation of a policy.
We have seen that rewards (sometimes negative rewards are called penalties, but it’s preferable to use a standardized notation) are the only feedback provided by the environment after each action. However, there are two different approaches to the use of rewards. The first one is the strategy of a very short-sighted agent and consists of taking into account only the reward just received. The main problem with this approach is clearly the inability to consider longer sequences that can lead to a very high reward. For example, an agent has to traverse a few states with a negative reward (for example, -0.1), but after them, they arrive at a state with a very positive reward (for example, +5.0). A short-sighted agent couldn’t find out the best policy because it would simply try to avoid the immediate negative rewards. On the other hand, it’s better to suppose that a single reward contains a part of the future rewards that will be obtained following the same policy. This concept can be expressed by introducing a discounted reward, which is defined as:
In the previous expression, we are assuming an infinite horizon with a discount factor, which is a real number bounded between 0 and 1 (not included). When γ=0, the agent is extremely short-sighted, because of Rt = rt+1, but when γ→1, the current reward takes into account the future contributions discounted in a way that is inversely proportional to the time-step. In this way, very close rewards will have a higher weight than very distant ones. If the absolute value of all rewards is limited by a maximum immediate absolute reward rirmax, the previous expression will always be bounded. In fact, considering the properties of a geometric series, we get:
Clearly, the right choice of is a crucial factor in many problems and cannot be easily generalized. As in many other similar cases, I suggest testing different values, picking the one that minimizes the convergence speed while yielding a quasi-optimal policy. Of course, if the tasks are episodic with length T(ei), the discounted reward becomes:
In this article, we introduced one of the most important and fundamental concepts of reinforcement learning, focusing on the mathematical structure of an environment as a Markov Decision Process. We defined the value of a state as the expected future reward considering a sequence discounted by a factor, γ. The focus of Mastering Machine Learning Algorithms, Second Edition by Giuseppe Bonaccorso is to solve complex math behind the algorithms, analysing and dissecting them so readers understand the mathematical mechanics behind these algorithms and what makes them work.
About the author
Giuseppe Bonaccorso is Head of Data Science in a large multinational company. He received his M.Sc.Eng. in Electronics in 2005 from University of Catania, Italy, and continued his studies at University of Rome Tor Vergata, and University of Essex, UK. His main interests include machine/deep learning, reinforcement learning, big data, and bio-inspired adaptive systems. He is author of several publications including Machine Learning Algorithms and Hands-On Unsupervised Learning with Python, published by Packt.