In this project I applied reinforcement learning techniques for a self-driving agent in a simplified world to aid it in effectively reaching its destinations in the allotted time. First I investigated the environment the agent operated in by constructing a very basic driving implementation. Once the driving agent was successful at operating within the environment, then I identifed each possible state the agent could be in when considering such things as traffic lights and oncoming traffic at each intersection. With states identified, then I implemented a Q-Learning algorithm for the self-driving agent to guide the agent towards its destination within the allotted time. Finally, I improved upon the Q-Learning algorithm to find the best configuration of learning and exploration factors to ensure the self-driving agent is reaching its destinations with consistently positive results.
The smartcab operates in an ideal, grid-like city (similar to New York City), with roads going in the North-South and East-West directions. Other vehicles will certainly be present on the road, but there will be no pedestrians to be concerned with. At each intersection there is a traffic light that either allows traffic in the North-South direction or the East-West direction. U.S. Right-of-Way rules apply:
Assume that the smartcab is assigned a route plan based on the passengers starting location and destination. The route is split at each intersection into waypoints, and you may assume that the smartcab, at any instant, is at some intersection in the world. Therefore, the next waypoint to the destination, assuming the destination has not already been reached, is one intersection away in one direction (North, South, East, or West). The smartcab has only an egocentric view of the intersection it is at: It can determine the state of the traffic light for its direction of movement, and whether there is a vehicle at the intersection for each of the oncoming directions. For each action, the smartcab may either idle at the intersection, or drive to the next intersection to the left, right, or ahead of it. Finally, each trip has a time to reach the destination which decreases for each action taken (the passengers want to get there quickly). If the allotted time becomes zero before reaching the destination, the trip has failed.
The smartcab receives a reward for each successfully completed trip, and also receives a smaller reward for each action it executes successfully that obeys traffic rules. The smartcab receives a small penalty for any incorrect action, and a larger penalty for any action that violates traffic rules or causes an accident with another vehicle. Based on the rewards and penalties the smartcab receives, the self-driving agent implementation should learn an optimal policy for driving on the city roads while obeying traffic rules, avoiding accidents, and reaching passengers? destinations in the allotted time.
The basic driving agent is not concerned with any sort of driving policy but choose random action from the set of possible actions such as None, forward, left, right at each intersection. But it does not make reference for such informations as next waypoint, traffic light, current time left from the alloted deadline.
The driving agent arrives in the destination with the radom movement across a map eventually but rarely within a time limit and with lots of moves such as 7551 moves over 100 trials.
#import agent1.py log file to show number of agent's successful reach to the destination out of 100 trials
#with enforce_deadline=False
import pandas as pd
pd_result = pd.read_csv("./s_submit/smartcab1/log/agent1/1111_final_enfocedeadline_false/success_total_agent1.csv",
names=['success','total','max_trial','reward','total_reward','penalties','moves','penalties_vs_moves'])
pd_result
print pd_result['penalties_vs_moves'].mean()
import numpy
import matplotlib.pyplot as plt
import matplotlib
from IPython.display import display # Allows the use of display() for DataFrames
# Show matplotlib plots inline (nicely formatted in the notebook)
%matplotlib inline
matplotlib.style.use('ggplot')
pd_success = pd_result['success']
ax = pd_success.plot( title ="trial vs. success ",figsize=(10,6),legend=True, fontsize=12)
ax.set_xlabel("trial number",fontsize=12)
ax.set_ylabel("success count",fontsize=12)
plt.show()
The several parallel lines to x-axis such as trial number around 10 to 20, 50 to 60, 60 to 70 exist on above chart , which means there is no leaning from previous action , rewards.
In follwing charts , total rewards are respectively -545, -440, -380 and penalties are 1836, 1752, 1672. Also total movements of agent over 100 trials are 2869, 2783, 2648. I will compare these charts with that of QLearning agent to demonstrate the superiority of QLearning algorithm in terms of success ratio.
pd_result = pd.read_csv("./s_submit/smartcab1/log/agent1/1111_final_enfocedeadline_true/success_total_agent1.csv",
names=['success','total','max_trial','reward','total_reward','penalties','moves','penalties_vs_moves'])
pd_result.tail()
pd_result = pd.read_csv("./s_submit/smartcab1/log/agent1/1111_final_enfocedeadline_true2/success_total_agent1.csv",
names=['success','total','max_trial','reward','total_reward','penalties','moves','penalties_vs_moves'])
pd_result.tail()
pd_result = pd.read_csv("./s_submit/smartcab1/log/agent1/1111_final_enfocedeadline_true3/success_total_agent1.csv",
names=['success','total','max_trial','reward','total_reward','penalties','moves','penalties_vs_moves'])
pd_result.tail()
As shown below table, the agent's act is not based on the previous action and previous rewards, in other words, it is in "finding" mode instead of "utilizing" mode . So it does not make use of previous rewarding actions but take new action regardless of previous rewading action. For example, in below table, at step 19, 21 in which light is red, agent choose action 'forward' eventually produce poor rewards not knowing the traffic rule. And it got stuck at same position over many times since it choose action, None regrdless of traffic light or oncoming traffic.
pd_result = pd.read_csv("./s_submit/smartcab1/log/agent1/1111_final_enfocedeadline_false/forcsv.csv",
names=['start[0]','start[1]','dest[0]','dest[1]','inputs_oncoming','inputs_left','inputs_right',
'light', 'action','heading[0]','heading[1]','location[0]','location[1]','reward'])
pd_result
from collections import Counter
import string
doc1 = pd_result['action']
action_count = Counter(doc1)
action_count
pd_light_action = pd_result[['light','action']]
pd_light_action['bad_action']=pd_light_action.apply(lambda x: x['light']=='red' and x['action']=='forward', axis = 1)
print "Number of action 'forward' taken in spite of light' red is ", len(pd_light_action[pd_light_action['bad_action']])
print "Percentage : ", len(pd_light_action[pd_light_action['bad_action']])/float(len(pd_light_action['bad_action']))*100,"%"
result_none =pd_result.loc[pd_result['action']=='None']['action']
pd_result=pd.DataFrame(result_none)
pd_result.index.name='step'
pd_result
In the next Qlearning algorithm, There are following set of states to consider to model the driving agent.
Since there are 4 cars, dummy agents in the car simulation environment, which makes the probability to collide each other on right, left or oncoming direction extremly low. Even if they collide, there are no penalties for these happenings. So the status such as oncoming, left, right may have little meaning in modeling driving agent. The status used in this simulation are 4 variables (None, Left, Right, and Forward) and traffic light consists of 2 variables (green , red), which makes 4 + 2 = 6 states. This 6 states are coupled with 4 different actions such as None, Left, Right , Forward. Total state action pair is 6 x 4 = 24 , which shoud be trained to optimize to get the destination. There is no need to consider deadline to train QLearning agent even though it is used for the statics analysis of agent movement. since the redundant state makes the training time much longer and lower performance. (distance from start position to destination position can be from 1 to 12 step, so, deadline can be 5(1 x 5) to 60(5 x 12), deadline variable span can be 56(60 -5 +1). From this , I can compute action state pair to train by state actoin pair, 24 x 56(deadline span), which is total 1344 )
QLearning agent can pick the best action available from the current state based on Q-values, and return best action instead of random action such as None, Left, Right, Forward. this action is achieved through the mechanism of initializing , updating Qvalue in Q dictionary.Each action generate rewards and penalties, which are taken into the consideration in case of updating Q dictionary.
With Q-Leanging agent, it could reach the destination with less move, less penalties, better rewards and better success rate, since the agent could learn the rule of traffic lights and act according to the next_waypoint of planner. This agent need to take into the consideration of following states.
def update_Q(self,previous_state, previous_action,previous_reward,state ):
self.Qdictionary[(previous_state,previous_action)] = (1 - self.learning_rate) *
self.Qdictionary[(previous_state,previous_action)] + \
self.learning_rate * (self.previous_reward + self.discount_factor * self.get_Qmax(self.state)[0])
whereas the model of same parameter with initial Q value, 0 makes respectively 81/100, 94/100, 98/100.
pd_result = pd.read_csv("./s_submit/logs/LearningAgent_update1111_1.csv",
names=['deadline','light','oncoming','right','left','action','reward'])
pd_result
result_r = pd_result['reward']
ax = result_r.plot(title ="rewards vs. moves ",figsize=(10,6),legend=True, fontsize=12)
ax.set_xlabel("move",fontsize=12)
ax.set_ylabel("rewards",fontsize=12)
plt.show()
pd_light_action = pd_result[['light','action']]
pd_light_action['bad_action']=pd_light_action.apply(lambda x: x['light']=='red' and x['action']=='forward', axis = 1)
print "Number of action 'forward' taken in spite of light' red is ", len(pd_light_action[pd_light_action['bad_action']])
print "Percentage : ", len(pd_light_action[pd_light_action['bad_action']])/float(len(pd_light_action['bad_action']))*100,"%"
pd_result_q = pd.read_csv("./s_submit/logs/success_total1111_3.csv",
names=['success','total','max_trial','reward','total_reward','penalties','moves','penalties_vs_moves'])
pd_result_q
print pd_result_q['penalties_vs_moves'].mean()
pd_success = pd_result_q['success']
ax = pd_success.plot( title ="trial vs. success ",figsize=(10,6),legend=True, fontsize=12)
ax.set_xlabel("trial number",fontsize=12)
ax.set_ylabel("success count",fontsize=12)
plt.show()
In agent.py, I implemented 3 layer for loop consist of learning rate in [0.7, 0.8 , 0.9] with the intention of learning from more recent information, dicount factor in [0.1, 0.2 ,0.3 ,0.33, 0.4, 0.44] with more focusing on the current rewards and epsilon in [0.0, 0.1, 0.2 ,0.3 0.4] with . After exhausitive search for the optimal parameters,I found two optimal parameter pairs, one is ( learning rate 0.7, discount factor 0.1, epsilon 0.1) the other is (0.9, 0.33, 0.1). Each of them similarly makes the best frequency, 97 to succeed to reach the destination out of 100 try. After flipping the coin, I would choose (0.7, 0.1 ,0.1) pairs to implement Learning agent, which perform far better than basic agent whose success ratio used be 17, 14, 20 out of 100 trials.
pd_optimal = pd.read_csv("./s_submit/logs/optimal_result1111.csv",
names=['learnig_rate','discount_factor','epsilon','success','total_reward'])
pd_optimal['learnig_rate'] = pd_optimal['learnig_rate'].str.strip('[')
pd_optimal['total_reward'] = pd_optimal['total_reward'].str.strip(']')
pd_optimal_p =pd.DataFrame(pd_optimal)
pd_optimal_p.index.name='step'
pd_optimal_p
pd_optimal_p.sort(['success'],ascending=False).head()
pd_result_q = pd.read_csv("./s_submit/logs/success_total1111_1.csv",
names=['success','total','max_trial','reward','total_reward','penalties','moves','penalties_vs_moves'])
pd_result_q['reward_vs_move'] = pd_result_q['total_reward']/pd_result_q['moves']
pd_result_q.tail()
pd_result_q_pm = pd_result_q['moves']
ax = pd_result_q_pm.plot( title ='total moves',figsize=(10,6),legend=True, fontsize=12)
ax.set_xlabel("trial",fontsize=12)
ax.set_ylabel("total moves",fontsize=12)
plt.show()
pd_result_q_rm = pd_result_q['reward_vs_move']
ax = pd_result_q_rm.plot( title ='total reward_vs_total move ratio',figsize=(10,6),legend=True, fontsize=12)
ax.set_xlabel("trial",fontsize=12)
ax.set_ylabel("total reward_vs_ total move ratio",fontsize=12)
plt.show()
pd_result_q_pm = pd_result_q['penalties_vs_moves']
ax = pd_result_q_pm.plot( title ='total penalties_vs_total movel ratio',figsize=(10,6),legend=True, fontsize=12)
ax.set_xlabel("trial",fontsize=12)
ax.set_ylabel("total penalties_vs_total move ratio",fontsize=12)
plt.show()
Thus, this QLearning agent has learned the optimal policy to reach the destination quicky with right move according to traffic rule and smallest movement steps.
pd_result_q = pd.read_csv("./s_submit/logs/success_total1111_2.csv",
names=['success','total','max_trial','reward','total_reward','penalties','moves','penalties_vs_moves'])
pd_result_q.tail()
pd_result_q = pd.read_csv("./s_submit/logs/success_total1111_3.csv",
names=['success','total','max_trial','reward','total_reward','penalties','moves','penalties_vs_moves'])
pd_result_q.tail()