Tutorial 10 | Matt Choy.

🧠 Exercise 1 - Environment Representation

For this GridWorld environment randomness is based on action noise - an action is chosen (nominal action) and with some probability, a different action is selected (performed action).
In this particular scenario, it makes sense to split dynamics into deterministic component attempt_move(...) and a stochastic component stoch_action(...)
- For other environments (with different types of randomness, e.g. DragonGame), splitting up the dynamics may not be appropriate, so we have a single perform_action(...) function
Include a way to restart in a random location
- Do this so that our algorithm knows what to do from any initial state, rather than just optimising the path from a particular starting state.
- Restart from a random location at the start of each episode (rather than to the initial state).

def apply_move(self, s, a):
	# handle special cases
	if s in REWARDS.keys(): # is this state terminal - environment-specific.
		# go to the exit state
		next_state = EXIT_STATE
		reward = REWARDS[s]
		return next_state, reward
	elif s == EXIT_STATE: # not required for MDP problems, 
                        # but useful in this representation
		# go to a new random state - this effectively starts a new episode
		next_state = random.choice(self.states)
		reward = 0
		return next_state, reward

	# choose a random true action
	r = random.random() # between 0 and 1
	cumulative_prob = 0
	action = None
	# choosing stochastic action
	for k, v in self.stoch_action(a).items():
		cumulative_prob += v
		if r < cumulative_prob:
			action = k
			break

	# apply true action
	next_state = self.attempt_move(s, action)
	reward = 0
	return next_state, reward

In this specific case, it is sufficient to say that if the state is in REWARDS.keys() it is a terminal state - environment-specific thing.
elif s == EXIT_STATE this is the end of the episode
- Typically, we would create an absorbing state
- In this case, we just use the next state as a delimiter for episodes, and re-start the next episode in a new random state.
- We need to ensure that when we finish an episode, that the agent gets a reward
- Important as we need the reward to propagate through to the root / initial state so that a solution can be found
- Not always required → dependent on implementation

🧠 Exercise 2 - Q-Learning Write a function (or section of code) for choosing an action based on stored Q-values (including exploration strategy)

  # Using epsilon-greedy as exploration strategy.
	""" ==== select an action to perform (using Epsilon-Greedy ===== """
	best_q = -math.inf
	best_a = None

	# find our exploitation action
	for a in ACTIONS:
		if ((self.persistent_state, a) in self.q_values.keys() and
				self.q_values[(self.persistent_state, a)] > best_q):
			# if we've seen this (s, a) pair before and the stored q value is better 
	    # than our best-recorded Q-value, update it and update the best-recorded action!
		best_q = self.q_values[(self.persistent_state, a)]
		best_a = a

		# epsilon chance to choose random action
		if best_a is None or random.random() < self.EPSILON:
			# exploration
			action = random.choice(ACTIONS)
		else:
			# exploitation
			action = best_a

Could write this as a function, and call it as part of our training process or when called upon to select an action during the online component.
- Often it's typically best to separate the part where we choose the optimal action based on Q-values and the exploration strategy
- This means that we can use the first part (choosing optimal action based on Q-values) again when we are done with training and the agent is interacting with the world.
Better to separate as during training we want to have separate exploration
After approaching convergence, we don't want to consider action noise - we want to use our model that we've worked out to select the best action
Epsilon-greedy used for two reasons:
- "Good enough" for this example
- Don't need to keep track of state statistics
Epsilon-greedy isn't the best solution
- Has the same tendency / ratio of explore : exploit
  - We could also modify the value of $\epsilon$ to bias more toward exploitation as the solution quality increases (or as # episodes increases)
- Can use something like modified UCB where we bias toward exploration when the size of the confidence interval is large
  - When the confidence interval is relatively small, bias toward exploration as we are close to convergence.
the apply_move function is like the perform_action function in assignment 1 and 2.

When using Q-Value tables ( or dictionaries), they essentially work like a function approximator.
- We are essentially trying to use the values gained from the simulations to reconstruct the q-value function
  
  $Q(s,a)\leftarrow Q(s,a)+\alpha\times\text{Temporal Difference}\\Q(s,a)\leftarrow Q(s,a)+\alpha\times[\text{Target}-Q(s,a)]\\\text{Target}=r+\gamma \times max_{a'\in A}Q(s',a')$
- In the formula, we're starting at our old Q-value and moving toward our target!
- Need to have a Q(s, a) function that is 'good enough' to approximate values we haven't seen before
- The Q-Value table is a naïve approach to approximating the function, but it is easy
  - We don't necessarily know what form the actual Q-value function is.
  - Each time we get a data point by performing an action, we update that entry in the table
  - Naïve approach - we only update one row, and not the entire function.
  - This is appropriate if each (state, action) pair produces rewards that are independent of other (state, action) pairs - i.e. if the reward from one state doesn't affect the reward from another state
  - We want to generalise what we learn in one state to other state so that we can converge faster.
  - A 'good' function approximator will be able to receive a value (e.g. state) that it's never seen before and still return a fairly accurate approximation of the true value.
- Want to move toward the target value slowly - based on a single value.
  - We start at an initial point, and slowly move toward our new Q-value
  - Do this to eliminate noise / randomness.
  - Changes may be smaller than in MCTS, but moving in a more consistent direction

🧠 In Q-Learning, the target is based on the current state in contrast to Monte Carlo (where entire episode has to be completed for the target to be computed). However, as a result, there is a significant amount of variance.

def next_iteration(self):
		""" === This code here is directly taken from the code block above === """
		best_q = -math.inf
		best_a = None

		for a in ACTIONS:
			if ((self.persistent_state, a) in self.q_values.keys() and
					self.q_values[(self.persistent_state, a)] > best_q):
			best_q = self.q_values[(self.persistent_state, a)]
			best_a = a

			if best_a is None or random.random() < self.EPSILON: # exploration
				action = random.choice(ACTIONS)
			else:  # exploitation
				action = best_a
		""" ===== simulate result of the action ===== """
		next_state, reward = self.grid.apply_move(self.persistent_state, action)

		best_q1 = -math.inf
		best_a1 = None

		""" ===== Update the value table ====="""
		# s' and a' -> compute the target
		for a1 in ACTIONS:
      if ((next_state, a1) in self.q_values.keys() and
              self.q_values[(next_state, a1)] > best_q1):
          best_q1 = self.q_values[(next_state, a1)]
          best_a1 = a1
	  if best_a1 is None or next_state == EXIT_STATE:
				# if best_a1 is None, we haven't initialised any Q values yet.
				# like setting all values to 0 in Value Iteration.
	      best_q1 = 0
	  target = reward + (self.grid.discount * best_q1)
	  if (self.persistent_state, action) in self.q_values:
	      old_q = self.q_values[(self.persistent_state, action)]
	  else:
	      old_q = 0
		# temporal_difference = target - old_q
		# essentially using the old_q value to compute the new q-value
	  self.q_wtarget - old_q))

	  # move to next state
	  self.persistent_state = next_state

Exercise 3

🧠 Exercise 3a Compare the state value $V(S_t)$ update formula from Monte Carlo vs Temporal Difference (TD) reinforcement learning with a 1-step lookahead, i.e. TD(0)

In Monte Carlo and Temporal Difference formulas, we have the following equation. However, the target differs in each.

$V(S_t)\leftarrow V(S_t)+\alpha[\text{TARGET}-V(S_t)]$

In the Monte Carlo RL algorithm, we use the true final reward (returned) $G_t$ from the completion of the episode to update the values of each state

$V(S_t)\leftarrow V(S_t)+\alpha[G_t-V(S_t)]$

In TD(0) learning, we don't need to wait until the end of the episode to update $V(S_t)$ , and instead use a one-step look ahead based on current estimates, which is also called bootstrapping. That is, at time $t+1$ , the TD method uses the observed reward $R_{t+1}$ and immediately forms a TD target $R_{t+1} + \gamma V(S_t)$ , updating $V(S_t)$ with the TD error.
- We essentially use one step's worth of reward to compute the temporal difference.
- Therefore, the TD(0) update equation is defined as
$V(S_t)\leftarrow V(S_t)+\alpha[R_{t+1}+\gamma V(S_{t+1}-V(S_t))$

🧠 Exercise 3b Compare the update formula for the state value, $V(S_t)$ in TD learning vs the update formula for $Q(s,a)$ in Q-Learning

Note → Q-Learning is an implementation of TD Learning
For Q-Learning, we replace $V(S_t)$ with $Q(s,a)$ .
In addition, the value at the next step is estimated from the optimal policy, where we use the action that maximises the Q-value (i.e. $V(S_{t+1})$ becomes $\max_{a}Q(s',a')$

🧠 Exercise 3c Consider Q-Learning with linear VFA, where:
$Q(s, a)=\sum^n_i f_i(s,a)=w_1f_1(s,a)+w_2f_2(s_,1)+\cdots+w_nf_n(s,a)$
What does the update function for Q-learning with linear Q-Functions become?

In linear VFA, updates to Q(s, a) are replaced by updates to the weights w

$w_i\leftarrow w_i+\alpha\delta f_i(s,a)$

$\delta=[R_{t+1}+\gamma max_{a'} Q_{\bar{w}}(s',a') -Q_{\bar{w}}(s,a)]$ is our temporal difference

$Q_{\bar{w}}$ is the Q-table indexed by features
Instead of learning the Q-table, we are learning weights which are used to infer what the Q-Value for a particular (state, action) pair
- We hope that learning the weights of features will allow us to generalise the value of states that we haven't encountered before.
Q-value is computed each time from the sum of weights, rather than being stored in a table.
This is essentially a single-layer linear DNN

🧠 Exercise 4 Consider a reinforcement learning problem in the game of Pacman with linear VFA and Q-Learning. The actions available to the agent are UP, DOWN, LEFT and RIGHT. Assume we observe an UP action where Pacman's next state would result in being eaten and receiving a reward of -500, and that we are using two linear functions to represent Q(s, a)

A feature representing the proximity to food, denoted $f_{food}$

A feature representing the proximity to ghosts, denoted $f_{ghosts}$
Using linear function approximation, update the weights using a learning rate $\alpha=\frac{1}{250}$ , given the following information
$Q(s,a)=4.0\times f_{food}(s,a)-1.0f_{ghost}(s,a)\\$ $f_{food}(s,\text{UP})=0.5\\$ $f_{ghost}(s,\text{UP})=1.0\\$ $Q(s,a)=+1\\$ $R_{t+1}=R(s,a,s')=-500\\$

$\delta=-501$ (difference / temporal difference)

$w_i\leftarrow w_i + \alpha\delta f_i(s,a)$

$w_{food}\leftarrow4.0+\alpha[-501]\times0.5$

$w_{ghost}\leftarrow-1.0+\alpha[-501]\times1.0$

Therefore, the updated weights in the Q function are

$Q(s,a)=3.0\times f_{food}(s,a)-3.0 f_{ghost}(s,a)$

That is, $w_{food}=3.0, w_{ghost}=-3.0$