Reinforcement Learning: The Next Frontier Reinforcement Learning: The Next Frontier
Deep Learning in recent years has touched many milestones – convolutional neural networks have surpassed human performance in tasks like object... Reinforcement Learning: The Next Frontier

Deep Learning in recent years has touched many milestones – convolutional neural networks have surpassed human performance in tasks like object detection, image classification. Transformers are providing awesome results in natural language-based tasks. While these are outstanding achievements, these methods suffer from the fact that they require a large number of labeled training datasets, which is often difficult to get. Further, humans do not learn in this manner.

Reinforcement learning is a paradigm of learning, where the AI agent learns through experience. It interacts with the environment and gets rewards from the environment as feedback. Training an RL agent does not require a labeled dataset. Thought leaders in AI perceive reinforcement learning as a key towards Artificial General Intelligence. A recent paper entitled Reward is enough, by DeepMind, a subsidiary of Alphabet (Google), hypothesizes that intelligence and the abilities that we derive from it- like perception, social intelligence, generalization- can be understood as maximizing the rewards. This article, will introduce the reader to the reinforcement learning paradigm and show how it can be used to train an RL agent to land a spaceship.

Reinforcement Learning Paradigm

The field of deep learning is inspired by natural intelligence, and reinforcement learning is no exception. Consider a baby learning to walk, a bird learning to fly, or an RL agent trying to land a spaceship. They all have these three things common:

  • Trial and Error: Each agent (baby, bird, or RL agent) makes many unsuccessful attempts- learning from each failure.
  • Goal: The agent has a specific goal (to stand, fly, or land the spaceship).
  • Interaction with the environment: There is no manual, no teacher, no training sample from which it can learn. The only feedback is the feedback from the immediate environment. The feedback is in terms of some reward or punishment.

Training an RL agent is like teaching a pet; we cannot speak the same language- but when the pet does the action we asked it to do- we ‘reinforce’ it by rewarding it in terms of treats. So when you throw a ball and your dog fetches it back to you, and you give it a doggy biscuit- you are, in essence, doing reinforcement learning.

If I say the same thing in RL language, there is an environment- consisting of you, dog, ball, and the ground. The agent is the dog. The goal is fetching the ball. And reward the treat you give when it fetches the ball to you. The final aim of the agent, dog, is to find a policy $\pi$, that will maximize its rewards:

Reinforcement Learning diagram example

To write programs that can learn via reinforcement learning, we will need to give some mathematical representation to the whole thing. We do that by defining three terms states, actions, and rewards. States, S, is a set of all possible states, s, that can exist in our universe. Actions, A, are the set of all the possible actions agents can take at a particular time step. And rewards, R, is the feedback that it receives from the environment.

Additionally, we should have a function- which can show the change in state s, after the agent has performed action a. It is normally called a transition function. If you are making your own environment, you will be responsible for defining all of them. However, to facilitate learning, there are few open-source frameworks/libraries which provide the pre-made environments. We will be using OpenAI’s GYM library. So get ready to launch the spaceship.

Lunar Lander

Lunar Lander environment simulates the scenario where a spaceship (Lander) needs to land at a specific location under low-gravity conditions. The environment has a well-defined physics engine implemented, which takes care of the state transitions. The goal of the game is to direct the agent to the landing pad as softly and fuel-efficiently as possible. The state space is defined by the following eight variables:

  • x: coordinate of the spaceship
  • y: coordinate of the spaceship
  • $v_x$: horizontal velocity of the spaceship
  • $v_y$: vertical velocity of the spaceship
  • $\theta$: the orientation in space
  • $v_{\theta}$: the angular velocity of the spaceship
  • A boolean with value True if the left leg of the spaceship touches the ground
  • * A boolean with value True if the right leg of the spaceship touches the ground

As you guessed right, the top six are continuous variables, the action space, however, is discrete- there are four possible actions:

  • Do nothing
  • Fire left orientation engine
  • Fire right orientation engine
  • Fire main engine.

Firing the left and right engines results in torque on the lander, which causes it to rotate, and makes stabilizing difficult.

Random agent flying the spaceship

Thankfully, this spaceship cannot blow off, so let us see how a random agent will fly the ship. We will make use of OpenAI Gym for the task. The Gym module contains a large number of environments, and it also has support for making custom environments.

For the present task, we will use three methods available in the Gym module:

  • make: this instantiates the specified environment LunarLander-v2.
  • reset: it resets the environment to starting condition.
  • step: this function performs the specified action in the environment.

In the code below, the agent is choosing a random action using the sample method. IT does so repeatedly till the task is done- that is, the spaceship touches the ground:

import gym
env = gym.make('LunarLander-v2')
r = 0.0
obs = env.reset()
while True:
obs, rewards, dones, info = env.step(env.action_space.sample())
r += rewards
if dones:

No matter how many times I tried, the agent ends up with a negative value for reward. You can also try running the code; it takes only a few seconds to run. Here is the link to Jupyter Notebook of the same, you can open it in Colab and run it directly. The notebook also has code for visualizations – which takes a little longer to run.

Remember, the negative reward is equivalent to punishment

Deep Q Network Agent

Now, while a random agent manages to land the spaceship it is not the optimum use of fuel, and the landing may have sharp bumps. What can we do- we can train an agent which through many trials and errors- learn that given a situation, what the best possible action is.

It is very much like building a cheatsheet- in fact, the base algorithm is called Q-table, where an agent maintains a table consisting of Q values for all possible state-action pairs. The Q-function (also called the state-action value function) of a policy $\pi$, $Q^{\pi}(s, a)$, measures the expected return or discounted sum of rewards obtained from state $s$ by taking action $a$ first and following policy $\pi$ thereafter. The optimal Q-function $Q^*(s, a)$ is defined as the maximum return that can be obtained starting from observation $s$, taking action $a$ and following the optimal policy thereafter. The table entries are updated as the agent learns using the expression (Bellman Equation):

Now, maintaining a table is not possible when the state space and action space increases. In 2015, Google DeepMind proposed an algorithm – DQN-Deep Q Network– where they used a deep neural network to estimate the Q vlaue given a state. To do this they minimized the following loss at each time step $i$:


Here, $y_i$ is called the temporal difference (TD) target, and $y_i – Q$ is called the TD error. $\rho$ represents the behavior distribution, the distribution over transitions $\{s, a, r, s’\}$ collected from the environment.

Let us train a DQN agent available in stable-baselines to land the spaceship. We will use a simple multilayered perceptron model (MlpPolicy) as a Q-agent. Using the stable-baselines library is relatively easy; in just two lines, you can train the RL agent:

model = DQN(MlpPolicy, env, verbose=1, prioritized_replay=True,

If you now test the agent, you can see that after few thousands of trial and error episodes, the agent has managed to achieve a positive reward.

Final Words on Reinforcement Learning

It is extraordinary- without telling it how it managed to learn only from experience. With the win of AlphaGo on Lee Sedol, RL algorithms have established their success in the arena of Games. But that is not all, by defining the environment and right form of the reward function, one can train an RL agent for many other tasks- making a business decision, predicting the stock price, and a lot more.

If you are interested in building your own RL agents from scratch and designing your own Gym environments check the 4 hours Live training session that I am conducting on July 20th. You can also get introduced to the nitty-gritty of TensorFlow that we will be using to build the RL agents and basics of deep learning models via my book Deep Learning with TensorFlow 2 and Keras.

P.S. Stable-baselines do not support TensorFlow 2 as of now, so you would see some deprecation warnings in the notebook, ignore them.

About the author/Ai+ Speaker on Reinforcement Learning:

Amita Kapoor is the author of best-selling books in the field of Artificial Intelligence and Deep Learning. She mentors students at different online platforms such as Udacity and Coursera and is a research and tech advisor to organizations like DeepSight AI Labs and MarkTechPost. She started her academic career in the Department of Electronics, SRCASW, the University of Delhi, where she is an Associate Professor. She has over 20 years of experience in actively researching and teaching neural networks and artificial intelligence at the university level.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.