Deep Q-Learning Algorithm in Reinforcement Learning

Deep LearningModelingNeural NetworksReinforcement Learningposted by ODSC Community February 21, 2020

In this article, we will discuss Q-learning in conjunction with neural networks (NNs). This combination has the name deep Q-network (DQN). This...

In this article, we will discuss Q-learning in conjunction with neural networks (NNs). This combination has the name deep Q-network (DQN).

This article is an excerpt from the book Deep Reinforcement Learning Hands-on, Second Edition by Max Lapan. This book provides you with an introduction to the fundamentals of RL, along with the hands-on ability to code intelligent learning agents to perform a range of practical tasks.

The Q-learning method solves the issue of iteration over the full set of states, but still can struggle with situations when the count of the observable set of states is very large. For example, Atari games can have a large variety of different screens, so if we decide to use raw pixels as individual states, we will quickly realize that we have too many states to track and approximate values for.

In some environments, the count of different observable states could be almost infinite. For example, in CartPole, the environment gives us a state that is four floating-point numbers. The number of value combinations is finite (they’re represented as bits), but this number is extremely large. We could create some bins to discretize those values, but this often creates more problems than it solves: we would need to decide what ranges of parameters are important to distinguish as different states and what ranges could be clustered together.

In the case of Atari, one single pixel change doesn’t make much difference, so it’s efficient to treat both images as a single state. However, we still need to distinguish some of the states.

The following image shows two different situations in a game of Pong. We’re playing against the artificial intelligence (AI) opponent by controlling a paddle (our paddle is on the right, whereas our opponent’s is on the left). The objective of the game is to get the bouncing ball past our opponent’s paddle, while preventing the ball from getting past our paddle. We can consider the two situations to be completely different: in the right-hand situation, the ball is close to the opponent, so we can relax and watch. However, the situation on the left is more demanding: assuming that the ball is moving from left to right, the ball is moving toward our side, so we need to move our paddle quickly to avoid losing a point. The situations in Figure 6.2 are just two from the 1070802 possible situations, but we want our agent to act on them differently.

Figure 6.2: The ambiguity of observations in Pong. On the left image, the ball is moving to the right toward our paddle, and on the left, its direction is opposite

As a solution to this problem, we can use a nonlinear representation that maps both the state and action onto a value. In machine learning, this is called a “regression problem.” The concrete way to represent and train such a representation can vary, but as you may have already guessed from this section’s title, using a deep NN is one of the most popular options, especially when dealing with observations represented as screen images. With this in mind, let’s make modifications to the Q-learning algorithm:

1. Initialize Q(s, a) with some initial approximation.
2. By interacting with the environment, obtain the tuple (s, a, r, s’).
3. Calculate loss: if the episode has ended, or (look at the equation below) otherwise

4. Update Q(s, a) using the stochastic gradient descent (SGD) algorithm, by minimizing the loss with respect to the model parameters.
5. Repeat from step 2 until converged.

The preceding algorithm looks simple, but, unfortunately, it won’t work very well. Let’s discuss what could go wrong.

Interaction with the environment

First of all, we need to interact with the environment somehow to receive data to train on. In simple environments, such as FrozenLake, we can act randomly, but is this the best strategy to use? Imagine the game of Pong. What’s the probability of winning a single point by randomly moving the paddle? It’s not zero, but it’s extremely small, which just means that we will need to wait for a very long time for such a rare situation. As an alternative, we can use our Q-function approximation as a source of behavior (as we did before in the value iteration method, when we remembered our experience during testing).

If our representation of Q is good, then the experience that we get from the environment will show the agent relevant data to train on. However, we’re in trouble when our approximation is not perfect (at the beginning of the training, for example). In such a case, our agent can be stuck with bad actions for some states without ever trying to behave differently. This is called the exploration versus exploitation. On the one hand, our agent needs to explore the environment to build a complete picture of transitions and action outcomes. On the other hand, we should use the interaction with the environment efficiently: we shouldn’t waste time by randomly trying actions that we have already tried and have learned outcomes for.

As you can see, random behavior is better at the beginning of the training when our Q approximation is bad, as it gives us more uniformly distributed information about the environment states. As our training progresses, random behavior becomes inefficient, and we want to fall back to our Q approximation to decide how to act.

A method that performs such a mix of two extreme behaviors is known as an epsilon-greedy method, which just means switching between random and Q policy using the probability hyperparameter. By varying, we can select the ratio of random actions. The usual practice is to start with  (100% random actions) and slowly decrease it to some small value, such as 5% or 2% random actions. Using an epsilon-greedy method helps us both to explore the environment in the beginning and stick to a good policy at the end of the training. There are other solutions to the exploration versus exploitation problem, and we will discuss some of them in the third part of the book. This problem is one of the fundamental open questions in RL and an active area of research, which is not even close to being resolved completely.

SGD optimization

The core of our Q-learning procedure is borrowed from supervised learning. Indeed, we are trying to approximate a complex, nonlinear function, Q(s, a), with an NN. To do this, we must calculate targets for this function using the Bellman equation and then pretend that we have a supervised learning problem at hand. That’s okay, but one of the fundamental requirements for SGD optimization is that the training data is independent and identically distributed (frequently abbreviated as i.i.d.).

In our case, data that we are going to use for the SGD update doesn’t fulfill these criteria:

1. Our samples are not independent. Even if we accumulate a large batch of data samples, they will all be very close to each other, as they will belong to the same episode.

2. Distribution of our training data won’t be identical to samples provided by the optimal policy that we want to learn. Data that we have will be a result of some other policy (our current policy, random, or both in the case of epsilon-greedy), but we don’t want to learn how to play randomly: we want an optimal policy with the best reward.

To deal with this nuisance, we usually need to use a large buffer of our past experience and sample training data from it, instead of using our latest experience. This technique is called a replay buffer. The simplest implementation is a buffer of fixed size, with new data added to the end of the buffer so that it pushes the oldest experience out of it. Replay buffer allows us to train on more-or-less independent data, but the data will still be fresh enough to train on samples generated by our recent policy.

Correlation between steps

Another practical issue with the default training procedure is also related to the lack of i.i.d data, but in a slightly different manner. The Bellman equation provides us with the value of Q(s, a) via Q(s’, a’) (this process is called bootstrapping). However, both the states s and s’ have only one step between them. This makes them very similar, and it’s very hard for NNs to distinguish between them. When we perform an update of our NN’s parameters to make Q(s, a) closer to the desired result, we can indirectly alter the value produced for Q(s’, a’) and other states nearby. This can make our training very unstable, like chasing our own tail: when we update Q for state s, then on subsequent states we will discover that Q(s’, a’) becomes worse, but attempts to update it can spoil our Q(s, a) approximation, and so on.

To make training more stable, there is a trick, called target network, by which we keep a copy of our network and use it for the Q(s’, a’) value in the Bellman equation. This network is synchronized with our main network only periodically, for example, once in N steps (where N is usually quite a large hyperparameter, such as 1k or 10k training iterations).

This article, describes deep Q-networks (DQNs), an extension of the basic value-based methods. We checked the Q-learning algorithm on the FrozenLake environment and discussed the approximation of Q-values with NNs, and the extra complications that arise from this approximation. We covered several tricks for DQNs to improve their training stability and convergence, such as an experience replay buffer, target networks, and frame stacking. For a practical hands-on exploration of deep reinforcement learning with an array of exercises and projects to help readers put theory into action, refer to the latest edition of the bestselling guide, Deep Reinforcement Learning Hands-On, Second Edition by Max Lapan.