Introduction to Reinforcement Learning

Machines that can learn from their own experiences

Reinforcement learning is a type of machine learning in which an agent learns to behave in an unknown environment by performing actions and seeing the ensuing results. The agent’s objective is to learn to act in ways that maximizes expected long-term rewards.

When it comes to reinforcement learning, there is no expected outcome—the agent makes the best decision based on its knowledge, for which it is either rewarded or penalized. As such, it is bound to learn from past experiences.

Reinforcement Learning Process

Reinforcement learning systems can be divided into two main components:

  1. Agent (algorithm or model)
  2. Environment (the setting the agent is acting upon)

The environment sends a state to the agent, upon which it performs an action based on observations made. In turn, the environment will send the next state and the respective reward back to the agent.

The agent will then update its knowledge with the reward given by the environment and use it to evaluate its previous action. This loop continues until the environment sends a terminal state, which means that the agent has accomplished all its tasks and has finally achieved the reward.

Policy means state to actions mapping. Here, the algorithm used by the software agent to determine its actions is called its policy. RL policy search refers to the set of actions an agent must take in order to increase the long term rewards. For example, the policy could be a neural network taking observations as inputs and outputting the action to take.

For example, consider a robotic vacuum cleaner whose reward is the amount of dust it picks up. There are mainly two policy parameters:

  1. Probability that it moves forward is p and (correspondingly) that it moves left or right is 1-p.
  2. Rotation angle is within the range of +r to -r.

The policy involves randomness of both direction and angle selection—it is called a stochastic policy.

One of the other techniques to discover ideal policy parameters in the direction of a greater cumulative reward involves the use of Policy Gradients. This entails learning policy parameters by maximizing rewards along the rising gradient in a relationship involving rewards vs parameters.

For example, in the vacuum cleaner robot example, we could slightly increase the range of r and then analyze whether it has a positive effect on the long-term rewards. If it does, we can continue tweaking the parameters up to the point that it increases the amount of dirt collected. This policy is somewhat similar to the process of training logistic regression models.

Reward Maximization

Reward maximization theory states that an RL agent must be trained in such a way that it takes the best action so that the reward is maximized.

The end goal in the environment depicted above is to eat the maximum amount of meat before being eaten by the tiger. The fox must eat the meat chunks furthest away from the tiger to minimize the chances of being eaten itself. The agent would not eat the chunks closer to the tiger, even though they are larger. This is called discounting.

The discounting of rewards works based on a value called Gamma(𝛾). If the value of Gamma(𝛾) is lower, this means the agent is not going to explore and eat the chunks closer to the opponent, in order to avoid the risk.

Exploitation and Exploration

Exploitation involves using the already known information to heighten the rewards. Exploration, on the other hand, means exploring and capturing more information about the environment.

For example, the fox eats the meat chunks close to itself instead of going closer to the tiger and eating the larger chunks. This is exploitation, whereas if the agent decides to explore the whole environment and attain maximum rewards, that would fall under exploration.

Markov Decision Process

The mathematical approach of mapping a solution in reinforcement learning is called Markov Decision Process (MDP). An MDP provides a mathematical framework to formalize reinforcement learning in sequential decision making where outcomes are partly random and partly under the control of the decision maker.The following parameters are used to attain a solution:

  • A is the set of actions agent can choose to take
  • S is the set of states
  • R is the Reward accumulated by the actions of the agent
  • π is the Policy
  • V is the value.

Markov Property

  • Transition: Moving from one state to another is called a transition.
  • Transition Probability: The probability that the agent will move from one state to another is called the transition probability.
  • The Markov Property states that:

Mathematically, we can express this statement as :

S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this equation represents is that the transition from state S[t] to S[t+1] is entirely independent of the past.

  • State Transition Probability: Having known about transition probabilities, we can define the state transition probability as follows: For Markov State from S[t] to S[t+1] i.e. any other successor state , the state transition probability is given by:

For example, in a chess game there are finite Markovian states, as the current state and action on the board is independent of the past states. On the other hand, a share market can be called non-Markovian, as the share value on a particular day depends on the pattern/sequence of events in the past.

In the above-stated problem, we have:

  • Set of states: high and low
  • Actions to traverse from one state to another: wait, search, rescue and recharge
  • Reward(r) is the cost represented by each edge
  • Policy is the path taken by the agent

For example, the MDP of a recycling robot that collects empty cans from an environment is represented in above figure. It has two states and possible discrete actions at each step. It can search for cans for a fixed amount of time, remain stationary, wait for somebody to bring the can, or recharge the battery if it’s low.

At a high state, the robot searches for cans and might not change the state with probability α, or the energy level may drop too low (by probability 1-α). The reward in both cases is the cans found. In both states, the robot can wait so that it does not drain the battery, with some associated reward (r_wait).

Searching at a low state might deplete the battery and cause a rescue action to be taken involving a negative reward. Another dynamic in this problem is to recharge the battery for a transition from a low state to a high state.

Understanding Q-Learning with an example

Q-Learning is a value-based reinforcement learning algorithm used to find the optimal action-selection policy using a Q function.

Let’s suppose there’s a building with 5 rooms numbered 0–4. An agent is placed in any one of the rooms and the goal is to reach outside the building (room 5). Doors 1 and 4 lead into the building from room 5. A reward value is associated with each door.

The rooms are represented on a graph, with each room acting as a node and each door as a link. Doors that directly connect to room 5 are given higher reward values than others.

A reward matrix R is constructed from the state diagram and reward values. The reward matrix for above state diagram is as shown below:

Another matrix called the Q matrix is constructed that depicts the memory of what the agent has learned up until now from its experiences. A Q-Table is simply a lookup table where we calculate the maximum expected future rewards for action at each state.

The rows of the Q-matrix represent the current state of the agent, while the columns represent the possible actions leading to the next state. The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a). The formula for calculating the Q matrix is as follows:

First, the Q matrix is initialized as a zero matrix. Taking the example of the room problem described above, Q (1,5) is calculated by taking Gamma(𝛾) value 0.8, denoting more exploration.

Q-Learning Algorithm

  1. Set the gamma parameter and environment rewards in the reward matrix (R)
  2. Initialize matrix Q to zero
  3. Select a random initial state
  4. Set initial state=current state
  5. Select one among all possible actions for the current state
  6. Using this possible action, consider going to the next state
  7. Get the maximum Q value for this next state based on all possible actions
  8. Compute Q(state, action) from the above formula
  9. Repeat above steps until current state=goal state


Congratulations on sticking through until the end!

In this post, we’ve covered the various fundamental concepts of reinforcement learning and the key terms associated with it. In future posts, I’ll work through a quick implementation of the concepts in Python. Please do share your experiences about reinforcement learning in the comments below.

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *