This is an implementation using the concepts of Q-Learning, which I covered in a previous blog post providing a high-level overview of reinforcement learning (RL).
To help demonstrate these concepts, I’ll be covering how to automate an agent to find the shortest route from its source to a particular destination, recognizing the environment and obstacles, thus learning from its experiences. A typical example would be a robot exploring the environment and finding the optimal way to the goal.
In this walk-through, we’ll use Q-learning to find the shortest path between two nodes. This approach requires constant trial and error, as the agent collects data about its surroundings and figures out how to accomplish its goal, which is to maximize its reward in the fewest possible steps.
1. Building the environment
As a first step, we need to build an environment with nodes and edges representing the directions a bot can take. We can easily create complex graphs and visualize everything with networkx graphs, which is a Python package used to study the structure and dynamics of infrastructure networks.
import numpy as np import pylab as plt import networkx as nx # mapping denote the corresponding edges between nodes points_list = [(0,1), (1,5), (5,6), (5,4), (1,2), (2,3), (2,7)] goal = 3 G=nx.Graph() G.add_edges_from(points_list) pos = nx.spring_layout(G) nx.draw_networkx_nodes(G,pos) nx.draw_networkx_edges(G,pos) nx.draw_networkx_labels(G,pos) plt.show()
2. Creating a Reward Matrix
A reward is an integral part of the environmental feedback. When an agent interacts with the environment, the changes in the state are reflected through its actions by the reward signal .
The goal, in general, is to solve a given task achieving the maximum reward possible. That is why many algorithms have a very small negative reward for each action the agent takes, to animate it to solve the task as fast as possible.
Hence, we can assign a NumPy matrix denoting the rewards to be -1 by default. Only the viable paths will be set to 0, while the maximum reward-giving path leading directly to the goal will be set to 200. The reward matrix helps the bot take the best action that will eventually lead it to its goal.
# assign zeros to paths and 200 to goal-reaching point for point in points_list: print(point) if point == goal: R[point] = 200 else: R[point] = 0 if point == goal: R[point[::-1]] = 200 else: # reverse of point R[point[::-1]]= 0 # add goal point round trip R[goal,goal]= 200
3. Framing the Q matrix
Next, we’ll add a similar Q matrix to the brain of our agent, representing the memory of what the agent has learned through its experience. The rows of the Q matrix represent the current state of the agent, and the columns represent the possible actions leading to the next state (the links between the nodes).
The Gamma parameter shown in the code snippet below also has some importance in the overall working of the agent. If Gamma is set closer to zero, the agent will be more inclined towards immediate rewards, whereas if set closer to one, the agent will consider future rewards with greater weight, and while willing to delay the reward.
#This equation, known as the Bellman equation, tells us that the maximum future reward. Q[current_state, action] = R[current_state, action] + gamma * max_value print('max_value', R[current_state, action] + gamma * max_value) if (np.max(Q) > 0): return(np.sum(Q/np.max(Q)*100)) else: return (0)
4. Training the model for certain epochs
In each training session, the agent explores the environment (with rewards represented by the R matrix defined earlier), and receives the reward (if any on its way) until it finally terminates by reaching the goal state. The purpose of training is to enhance the knowledge base of our agent, represented by the Q matrix. Generally speaking, more training = a more optimized Q matrix.
scores =  for i in range(700): current_state = np.random.randint(0, int(Q.shape)) available_act = available_actions(current_state) action = sample_next_action(available_act) score = update(current_state,action,gamma) scores.append(score)
5. Testing the model
To test whether the model has learned to find the optimal path, we can always feed it with a new initial state and see the optimal path it generates from its knowledge base.
#initial state from where the bot starts current_state = 7 steps = [current_state] while current_state != 3: next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,])) if next_step_index.shape > 1: next_step_index = int(np.random.choice(next_step_index, size = 1)) else: next_step_index = int(next_step_index) steps.append(next_step_index) current_state = next_step_index print("Most efficient path:") print(steps)
The above implementation can be molded a bit to solve other complex practical problems such as:
- Games such as Pac-Man or AlphaGo, in which the CPU learns to play by sensing the environment.
- Robotic applications such as path finders, part collectors in factories, etc.
- Self-driving cars where the route that needs to be taken is the most optimum and involves the least amount of traffic (more reward).
The complete code for this problem, with all method declarations, can be found at my GitHub repo linked below:
Thanks for sticking all the way through until the end!