π§ Exercise 1 - Environment Representation
attempt_move(...)
and a stochastic component stoch_action(...)
perform_action(...)
functiondef apply_move(self, s, a):
# handle special cases
if s in REWARDS.keys(): # is this state terminal - environment-specific.
# go to the exit state
next_state = EXIT_STATE
reward = REWARDS[s]
return next_state, reward
elif s == EXIT_STATE: # not required for MDP problems,
# but useful in this representation
# go to a new random state - this effectively starts a new episode
next_state = random.choice(self.states)
reward = 0
return next_state, reward
# choose a random true action
r = random.random() # between 0 and 1
cumulative_prob = 0
action = None
# choosing stochastic action
for k, v in self.stoch_action(a).items():
cumulative_prob += v
if r < cumulative_prob:
action = k
break
# apply true action
next_state = self.attempt_move(s, action)
reward = 0
return next_state, reward
elif s == EXIT_STATE
this is the end of the episode
π§ Exercise 2 - Q-Learning Write a function (or section of code) for choosing an action based on stored Q-values (including exploration strategy)
# Using epsilon-greedy as exploration strategy.
""" ==== select an action to perform (using Epsilon-Greedy ===== """
best_q = -math.inf
best_a = None
# find our exploitation action
for a in ACTIONS:
if ((self.persistent_state, a) in self.q_values.keys() and
self.q_values[(self.persistent_state, a)] > best_q):
# if we've seen this (s, a) pair before and the stored q value is better
# than our best-recorded Q-value, update it and update the best-recorded action!
best_q = self.q_values[(self.persistent_state, a)]
best_a = a
# epsilon chance to choose random action
if best_a is None or random.random() < self.EPSILON:
# exploration
action = random.choice(ACTIONS)
else:
# exploitation
action = best_a
We are essentially trying to use the values gained from the simulations to reconstruct the q-value function
In the formula, we're starting at our old Q-value and moving toward our target!
Need to have a Q(s, a) function that is 'good enough' to approximate values we haven't seen before
The Q-Value table is a naΓ―ve approach to approximating the function, but it is easy
Want to move toward the target value slowly - based on a single value.
π§ In Q-Learning, the target is based on the current state in contrast to Monte Carlo (where entire episode has to be completed for the target to be computed). However, as a result, there is a significant amount of variance.
def next_iteration(self):
""" === This code here is directly taken from the code block above === """
best_q = -math.inf
best_a = None
for a in ACTIONS:
if ((self.persistent_state, a) in self.q_values.keys() and
self.q_values[(self.persistent_state, a)] > best_q):
best_q = self.q_values[(self.persistent_state, a)]
best_a = a
if best_a is None or random.random() < self.EPSILON: # exploration
action = random.choice(ACTIONS)
else: # exploitation
action = best_a
""" ===== simulate result of the action ===== """
next_state, reward = self.grid.apply_move(self.persistent_state, action)
best_q1 = -math.inf
best_a1 = None
""" ===== Update the value table ====="""
# s' and a' -> compute the target
for a1 in ACTIONS:
if ((next_state, a1) in self.q_values.keys() and
self.q_values[(next_state, a1)] > best_q1):
best_q1 = self.q_values[(next_state, a1)]
best_a1 = a1
if best_a1 is None or next_state == EXIT_STATE:
# if best_a1 is None, we haven't initialised any Q values yet.
# like setting all values to 0 in Value Iteration.
best_q1 = 0
target = reward + (self.grid.discount * best_q1)
if (self.persistent_state, action) in self.q_values:
old_q = self.q_values[(self.persistent_state, action)]
else:
old_q = 0
# temporal_difference = target - old_q
# essentially using the old_q value to compute the new q-value
self.q_wtarget - old_q))
# move to next state
self.persistent_state = next_state
π§ Exercise 3a Compare the state value update formula from Monte Carlo vs Temporal Difference (TD) reinforcement learning with a 1-step lookahead, i.e. TD(0)
In Monte Carlo and Temporal Difference formulas, we have the following equation. However, the target differs in each.
In TD(0) learning, we don't need to wait until the end of the episode to update , and instead use a one-step look ahead based on current estimates, which is also called bootstrapping. That is, at time , the TD method uses the observed reward and immediately forms a TD target , updating with the TD error.
π§ Exercise 3b Compare the update formula for the state value, in TD learning vs the update formula for in Q-Learning
π§ Exercise 3c Consider Q-Learning with linear VFA, where:
What does the update function for Q-learning with linear Q-Functions become?
In linear VFA, updates to Q(s, a) are replaced by updates to the weights w
is our temporal difference
is the Q-table indexed by features
Instead of learning the Q-table, we are learning weights which are used to infer what the Q-Value for a particular (state, action) pair
Q-value is computed each time from the sum of weights, rather than being stored in a table.
This is essentially a single-layer linear DNN
π§ Exercise 4 Consider a reinforcement learning problem in the game of Pacman with linear VFA and Q-Learning. The actions available to the agent are UP, DOWN, LEFT and RIGHT. Assume we observe an UP action where Pacman's next state would result in being eaten and receiving a reward of -500, and that we are using two linear functions to represent Q(s, a)
- A feature representing the proximity to food, denoted
- A feature representing the proximity to ghosts, denoted
Using linear function approximation, update the weights using a learning rate , given the following information
(difference / temporal difference)
Therefore, the updated weights in the Q function are