🧠 Exercise 1 - Environment Representation

def apply_move(self, s, a):
	# handle special cases
	if s in REWARDS.keys(): # is this state terminal - environment-specific.
		# go to the exit state
		next_state = EXIT_STATE
		reward = REWARDS[s]
		return next_state, reward
	elif s == EXIT_STATE: # not required for MDP problems, 
                        # but useful in this representation
		# go to a new random state - this effectively starts a new episode
		next_state = random.choice(self.states)
		reward = 0
		return next_state, reward

	# choose a random true action
	r = random.random() # between 0 and 1
	cumulative_prob = 0
	action = None
	# choosing stochastic action
	for k, v in self.stoch_action(a).items():
		cumulative_prob += v
		if r < cumulative_prob:
			action = k
			break

	# apply true action
	next_state = self.attempt_move(s, action)
	reward = 0
	return next_state, reward

🧠 Exercise 2 - Q-Learning Write a function (or section of code) for choosing an action based on stored Q-values (including exploration strategy)

  # Using epsilon-greedy as exploration strategy.
	""" ==== select an action to perform (using Epsilon-Greedy ===== """
	best_q = -math.inf
	best_a = None

	# find our exploitation action
	for a in ACTIONS:
		if ((self.persistent_state, a) in self.q_values.keys() and
				self.q_values[(self.persistent_state, a)] > best_q):
			# if we've seen this (s, a) pair before and the stored q value is better 
	    # than our best-recorded Q-value, update it and update the best-recorded action!
		best_q = self.q_values[(self.persistent_state, a)]
		best_a = a

		# epsilon chance to choose random action
		if best_a is None or random.random() < self.EPSILON:
			# exploration
			action = random.choice(ACTIONS)
		else:
			# exploitation
			action = best_a

🧠 In Q-Learning, the target is based on the current state in contrast to Monte Carlo (where entire episode has to be completed for the target to be computed). However, as a result, there is a significant amount of variance.

def next_iteration(self):
		""" === This code here is directly taken from the code block above === """
		best_q = -math.inf
		best_a = None

		for a in ACTIONS:
			if ((self.persistent_state, a) in self.q_values.keys() and
					self.q_values[(self.persistent_state, a)] > best_q):
			best_q = self.q_values[(self.persistent_state, a)]
			best_a = a

			if best_a is None or random.random() < self.EPSILON: # exploration
				action = random.choice(ACTIONS)
			else:  # exploitation
				action = best_a
		""" ===== simulate result of the action ===== """
		next_state, reward = self.grid.apply_move(self.persistent_state, action)

		best_q1 = -math.inf
		best_a1 = None

		""" ===== Update the value table ====="""
		# s' and a' -> compute the target
		for a1 in ACTIONS:
      if ((next_state, a1) in self.q_values.keys() and
              self.q_values[(next_state, a1)] > best_q1):
          best_q1 = self.q_values[(next_state, a1)]
          best_a1 = a1
	  if best_a1 is None or next_state == EXIT_STATE:
				# if best_a1 is None, we haven't initialised any Q values yet.
				# like setting all values to 0 in Value Iteration.
	      best_q1 = 0
	  target = reward + (self.grid.discount * best_q1)
	  if (self.persistent_state, action) in self.q_values:
	      old_q = self.q_values[(self.persistent_state, action)]
	  else:
	      old_q = 0
		# temporal_difference = target - old_q
		# essentially using the old_q value to compute the new q-value
	  self.q_wtarget - old_q))

	  # move to next state
	  self.persistent_state = next_state

Exercise 3

🧠 Exercise 3a Compare the state value V(St)V(S_t) update formula from Monte Carlo vs Temporal Difference (TD) reinforcement learning with a 1-step lookahead, i.e. TD(0)

In Monte Carlo and Temporal Difference formulas, we have the following equation. However, the target differs in each.

V(St)←V(St)+Ξ±[TARGETβˆ’V(St)] V(S_t)\leftarrow V(S_t)+\alpha[\text{TARGET}-V(S_t)]

V(St)←V(St)+Ξ±[Gtβˆ’V(St)] V(S_t)\leftarrow V(S_t)+\alpha[G_t-V(S_t)]

🧠 Exercise 3b Compare the update formula for the state value, V(St)V(S_t) in TD learning vs the update formula for Q(s,a)Q(s,a) in Q-Learning

🧠 Exercise 3c Consider Q-Learning with linear VFA, where:
Q(s,a)=βˆ‘infi(s,a)=w1f1(s,a)+w2f2(s,1)+β‹―+wnfn(s,a)Q(s, a)=\sum^n_i f_i(s,a)=w_1f_1(s,a)+w_2f_2(s_,1)+\cdots+w_nf_n(s,a)
What does the update function for Q-learning with linear Q-Functions become?

🧠 Exercise 4 Consider a reinforcement learning problem in the game of Pacman with linear VFA and Q-Learning. The actions available to the agent are UP, DOWN, LEFT and RIGHT. Assume we observe an UP action where Pacman's next state would result in being eaten and receiving a reward of -500, and that we are using two linear functions to represent Q(s, a)

Ξ΄=βˆ’501\delta=-501 (difference / temporal difference)

wi←wi+Ξ±Ξ΄fi(s,a)w_i\leftarrow w_i + \alpha\delta f_i(s,a)

wfood←4.0+Ξ±[βˆ’501]Γ—0.5w_{food}\leftarrow4.0+\alpha[-501]\times0.5

wghostβ†βˆ’1.0+Ξ±[βˆ’501]Γ—1.0w_{ghost}\leftarrow-1.0+\alpha[-501]\times1.0

Therefore, the updated weights in the Q function are

Q(s,a)=3.0Γ—ffood(s,a)βˆ’3.0fghost(s,a)Q(s,a)=3.0\times f_{food}(s,a)-3.0 f_{ghost}(s,a)