COMP3702 Lecture 11 | Matt Choy.

🧠 Deep Q-Networks (DQNs) approximate a state-value function in a Q-Learning framework using a neural network. In the case of Atari Games (like Breakout shown in the lectures), they make take in several frames of the game as an input and output state values for each action as n output. More generally, the NN learns to transform a state to estimated Q-values for each possible action (i.e. Q(s,a))

🧠 Exercise 1A Consider the CartPole environment of the OpenAI Gym, where the objective is to move a cart left or right in order to balance an upright pole for as long as possible. The Reinforcement Learning states, actions and rewards can be formalised as follows:

The state is specified by four parameters ( $x, v, \theta, \omega$ ), where $x$ : the horizontal position of the cart (positive == right) $v$ : the horizontal velocity of the cart (positive == moving to right) $\theta$ : the angle between the pole and the vertical position (positive == clockwise) $\omega$ : the angular velocity of the pole (positive == rotating clockwise)

The actions that the agent can perform are: 0: Push the cart to the left 1: Push the cart to the right

The game terminates when the pole deviates more than 15 degrees from vertical ( $|\theta|\ge\pi/12\approx0.26$ ). In each time step, if the game is not done, then the cumulative reward increases by 1. The goal of the game is to accumulate the highest cumulative reward.

Explain why standard Q-Learning using a table of state-action values can't be used in this environment?

The state is comprised of four continuous values, and would need to be discretised to update it in a regular (state, action) table.
Instead, the values can be input into a function approximator, taking the 4 continuous values as input and performing function approximation (using a neural network), to estimate the value of each action for a given state.
We use the neural network (which is a function approximator) in place of the Q-value table.
- The output of the Deep Q-Network are a collection of (action, reward) pairs for each possible action in the given state.

🧠 Exercise 1B Consider that the CartPole is now controlled by an analog joystick where instead of only being able to move the cart left and right, you may move faster left or faster right (the actions are now continuous). What is a limitation of the output format of Deep Q-Networks for this problem? What is a limitation of the output format of Deep Q-Networks for this problem? What alternative algorithm could provide a solution?

DQN outputs discrete values only, and therefore can't output (an infinite number of) analogue output(s).
An alternative algorithm is Policy Gradients, which can output continuous probability distributions. We could have the output on the x-axis (faster left to faster right).