🧠Deep Q-Networks (DQNs) approximate a state-value function in a Q-Learning framework using a neural network. In the case of Atari Games (like Breakout shown in the lectures), they make take in several frames of the game as an input and output state values for each action as n output. More generally, the NN learns to transform a state to estimated Q-values for each possible action (i.e. Q(s,a))
🧠Exercise 1A Consider the CartPole environment of the OpenAI Gym, where the objective is to move a cart left or right in order to balance an upright pole for as long as possible. The Reinforcement Learning states, actions and rewards can be formalised as follows:
- The state is specified by four parameters (
), where : the horizontal position of the cart (positive == right) : the horizontal velocity of the cart (positive == moving to right) : the angle between the pole and the vertical position (positive == clockwise) : the angular velocity of the pole (positive == rotating clockwise) - The actions that the agent can perform are: 0: Push the cart to the left 1: Push the cart to the right
- The game terminates when the pole deviates more than 15 degrees from vertical (
). In each time step, if the game is not done, then the cumulative reward increases by 1. The goal of the game is to accumulate the highest cumulative reward. Explain why standard Q-Learning using a table of state-action values can't be used in this environment?
🧠Exercise 1B Consider that the CartPole is now controlled by an analog joystick where instead of only being able to move the cart left and right, you may move faster left or faster right (the actions are now continuous). What is a limitation of the output format of Deep Q-Networks for this problem? What is a limitation of the output format of Deep Q-Networks for this problem? What alternative algorithm could provide a solution?