Lecture 11 - Deep Reinforcement Learning

Helpful Materials

Sutton & Barto Textbook

Lex Fridman - MIT 6.S091 - https://www.youtube.com/watch?v=zR11FLZ-O9M

Alexander Amini - MIT 6.S191 - https://www.youtube.com/watch?v=i6Mi2_QM3rA

Andrej Karpathy, Deep RL: Pong from Pixels - http://karpathy.github.io/2016/05/31/rl/

1.0 - What is Machine Learning?

Computer-based models learn from data, identify patterns and make decisions with minimal human intervention
- Machines that Learn
Incredible success in many applications
- Image recognition
- Image segmentation (labelling pixels into different classes / groups)
- Style transfer
- AlphaGo
- Breakout (Atari)

1.1 - Machine Learning Applications

Image recognition
Speech recognition
Traffic prediction
Recommender systems
Self-driving cars
Email spam filtering
Social media (Facebook, Twitter, etc) - recommender systems / advertising
Games: Chess, Atari, Go, DoTA, Poker
Medical/Health: Radiology image interpretation, predicting epidemic outbreaks etc.
Masked face recognition

2.0 - What is Deep Learning?

Artificial Intelligence is the umbrella term for anything that seems intelligent
Machine Learning - computer or machine learns from data
Deep Learning - A subset of Machine Learning that uses features in data.
- Neurons connected by edges
- Weights and features

2.1 - Image Recognition

Formerly called Artificial Neural Networks (2012)
ImageNet Large Scale Visualisation Recognition (ILSVRC)
- 1.2M images in 1K categories
- Classification: Make 5 guesses about the image label
- Huge jump in performance thanks to Deep learning in 2012.
Performance increases as the size of the neural net increases.

2.2 - Simple Neural Networks (Perceptron)

Weights determine the importance of each feature
Sum of $\text{weights}\times\text{inputs}$ applied to a non-linear function gives the output
Can be mathematically represented as:

$\hat{y}=g(\theta_0+\sum^m_{i=1}x_i\theta_i$

2.3 - Building Blocks of Neural Networks

A fully connected neural network is a series of nodes where a node is connected to every single node in the next layer.
- This is an example of a 3-layer NN (We don't count the input layer)

Can use weight sharing - if we have a distinct weight for every neuron-to-neuron connection, this grows exponentially.
- Use weight sharing to reduce the total number of weights.

Can use pooling to downsample the input features, or outputs of each layer in the NN

2.4 - Deep NNs

Residual network: Connection between VGG-19 and plain network??
Each box has a number associated with it.
A DNN such as this has on the order of 1M weights to learn.

2.5 - Training Neural Networks

Learn the parameters/weights by optimising an objective function:
- Minimise error in model's predictions f(x), compare to the true labels, y
- Performed through backpropagation and gradient descent
We feed in inputs (as represented by the matrix x)
These inputs get multiplied by the weights $\{w_1, w_2, w_3, w_4, w_5, w_6\}$ to determine the values at the nodes $z_1, z_2, z_3$
The values at the nodes $z_1, z_2, z_3$ are then multiplied by the weights $\{w_7, w_8, w_9\}$ to determine the final value $\hat{y}_1$
Through all of this, we are trying to approximate a function $f(x)$ $f (x)$
- The amount of error will inform the DNN and will cause an update in the weights.

The objective function / cost function / empirical risk is a measure of the difference between what our network predicted vs the actual observed value.

$J(\theta)=\frac{1}{n}\sum^n_{i=1}\mathcal{L({\color{red}f(x^{(i)}};\theta),{\color{green}y^{(i)}}})\\ {\color{red}\text{Predicted Value}}\\ {\color{green}\text{Actual Value}}$

2.5.1 - Optimisation / Function Approximation

Suppose we have a series of weights and features $\color{lightgray}\scriptsize\theta=\begin{bmatrix}w\\b\end{bmatrix}$ .
We then compute the gradient using partial derivatives. $\color{lightgray}\frac{\partial J(\theta)}{\partial\theta}$
Then we update the weights $\color{lightgray}\theta\leftarrow\theta-\alpha\frac{\partial J(\theta)}{\partial\theta}$
Non-linear function approximation is achieved by using a non-linear activation function

2.6 - Types of Machine Learning

Supervised Learning
- Input: Data with labels $\{x,y\}$
- Output: Learn mapping function $x\rightarrow y$ $x \to y$
  - Target used to determine error which is fed back into the supervised learning algorithm
Unsupervised learning
- Input: Data without labels $\{x\}$
- Output: Learn the underlying structure (e.g. classes)

Reinforcement Learning
- Input: State-Action Pairs $\{s,a\}$
- Output: Action that maximises expected future rewards
  - Evaluations lead to a reward encountered which is used as feedback in the RL algorithm

2.0 - Reinforcement Learning

Gather data as we are learning
Which data to use affects the learning performance
The agent has control over which data to use (agent does so in the form of Episodes)
Episode - Sequence of states from starting state to terminal state.
Agent performs an action $A_t$ and observes feedback (in this case, in the form of the next state and reward received).
Agent is able to modify the action that it chooses to maximise the reward.
Since the reward is unknown, the agent must explore the environment.

2.1 - Deep RL

A combination of Deep Learning + Reinforcement Learning
- Deep Learning: Feature representation learning
- Reinforcement Learning: Exploration / interaction in an environment.
Earlier layers in the network learn more primitive features (such as lines, edges)
Deeper layers in the network learn more complex features (e.g. hexagonal structure, beak in the bottom left corner)
Deep RL - good for function approximators and feature representation learners

2.2 - Reinforcement Learning Recap

The value of a state (or action) is the expected sum of discounted future rewards

$R_t=r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+...\\ V=E[\sum^\infty_{t=0}\gamma^t r_t]\\ V(s)=R(s)+\gamma max_{a}\ Q(s,a)\\ Q(s,a)=\sum_{s'}P(s'|s,a)V(s')\\ Q({\color{lightblue}s_t}, {\color{lightgreen}a_t})=\Bbb{E}[R_t|s_t,a_t]$
Reward at time step $t$ is given by the discounted expected reward in future time steps (discounted by $\gamma)$
The ultimate goal of our agent is to maximise V (i.e. the expected value of successive rewards).
The equations $V(s)=R(s)+\gamma max_{a}\ Q(s,a)$ and $Q(s,a)=\sum_{s'}P(s'|s,a)V(s')$ are bellman equations
- The relationship between $V(s)$ and $Q(s,a)$ is that the Q-value consist of state, action pairs whereas the value is just for a single state
The Q-function captures the expected total future reward an agent in state s can receive by executing a certain action a
- We want to maximise the expected reward for a given state, and action

2.2.1 - Policy

Once we know the Q-values, extracting the optimal policy is easy - pick the action with the highest value
Note - state-values already incorporate the sum of discounted future rewards.

2.3 - Deep RL Algorithms

Deep RL algorithms can be broadly classified into two types of algorithms

Value Learning

Find $Q(s,a)$
$a=argmax_a Q(s,a)$

Estimate the values of each (s, a) pair
After the estimates have converged, find action
- Action found by finding a with largest value for fixed state.

Policy Learning

Find $\pi(s)$
Sample $a\sim \pi(s)$

Find policy - sample actions from probabilistic policy.

2.4 - Value Learning RL

Actions:
- move_paddle_left
- move_paddle_right
- do_not_move_paddle
Rewards:
- If ball hits brick, reward = 1
- Otherwise, reward = 0
End condition
- If ball falls off the screen, the game ends.
Can we learn to control an agent directly from sensor inputs?
In breakout, sensory input would be a game screen frame
Finding a state representation - do we know the direction of the ball? the velocity of the ball?

2.4.1 - Q-Value Approximation

Use a function to approximate the Q-function: $Q(s;a;\theta)\approx Q^*(s,a)$ $Q (s; a; θ) \approx Q^{*} (s, a)$
- Which of these state-actions is better? A or B?
- Which has a higher $Q(s,a)$ ?
Hand-designing features $w(s,a)$ $w (s, a)$
- Performance depends on the quality of features
- Not generalized
- Doesn't scale well with game complexity
- Handcrafting features is very difficult!

We want to learn features from pixels!

We can stack consecutive frames for each state to determine velocity and direction
However:
- Assuming that there are $84\times84$ pixels per frame, where each pixel can take on 256 values with 4 frames
- There are approximately $256^{84\times84\times4}$ states
- Number of atoms in the universe on the order of $10^{78}$ to $10^{82}$

2.4.2 - DNNs as a Q-Value Approximator

To solve this, we can use a DNN as a Q(s,a) approximator

Input: State s, action a
Output: The expected return: Q(s, a)
Can we make this even more efficient?

Yes, we can set the state as the only input into the NN
NN outputs a Q-value for every possible action
Select action corresponding to the highest Q-value
A single network to predict Q(s, a) for all possible states and actions

2.4.3 - Training DQNs

Training Deep-Q-Networks can be done using the Bellman Equations
- $Q(s,a)=r+\gamma max_{a'}Q(s',a')$
We also need a loss function to determine how well our network is performing
- $L=\Bbb{E}[|r+\gamma max_a Q(s',a')-Q(s,a)|^2]$ $L = E [∣ r + γma x_{a} Q (s^{'}, a^{'}) - Q (s, a) ∣^{2}]$
  - $r+\gamma max_a Q(s',a')$ is the target
  - $Q(s,a)$ is the predicted value
- Use the neural network to learn the Q-function and use it to infer the optimal policy.

Take an experience $<s, a, r, s'>$ $< s, a, r, s^{'} >$ and update the Q-table as follows:
- Feed in state as an input into the NN and get it to generate the Q-values for each (s, a) pair (Do a feedforward pass for the current state s to get predicted Q-values for all actions)
- Let the action to be taken be the (s, a) pair with the largest Q-value (Do a feedforward pass for the next state s' and calculate the maximum overall network outputs $max_{a'} Q(s', a')$ )
- Perform the action a' and compute the Q-values for s' (Set the Q-value target for action to $r+\gamma max_{a'} Q(s',a')$ )
- Update the weights of the neural network using backpropagation

🧠 This is very similar to using TD target to make Q-Learning estimates closer to the actual value.

When training DQN naively, it doesn't always work that well
There are a few tricks that we can use to better train the weights
- Experience Replay (store states, actions and rewards to update the reward)
- Fixed Target Network
- Reward Clipping (avoid extremes)
- Skipping Frames

2.4.4 - DQN for Atari

A single algorithm that learns from pixels to extract a policy
Comparison of the DQN agent with the best RL methods in literature
The performance of the DQN is normalised with respect to a professional human games tester (that is, 100% level) and random play (that is 0% level)

Untitled

DQN Doesn't Always Work Well

In Atari Pinball, DQN does very well - 2539% of human performance
In Atari Time Pilot, DQN is approximately at the level of human performance
In Atari Montezuma's Revence, DQN does much worse than humans performance.
- Rewards are very sparse - the signals given to the algorithm don't occur very often.
- Also rewards are very far away in the future.

2.5 - Downsides of Q-Learning

Does not work well if environment has sparse, delayed rewards
Does not work well in continuous action spaces - outputs Q-values for every state and action, cannot do this for continuous state spaces
Policy is deterministically computed from the Q-function by maximising the reward → cannot learn stochastic policies.

2.6 - Policy Gradient

DQN (off-policy): Approximates Q and infers optimal policy
PG (on-policy): Directly optimises policy, $\pi (s)$
Note here that we have a single-layer NN

Policy Gradient directly optimise the policy $\pi(s)$
Enables modelling of continuous action spaces.
State as an input into the NN
NN learns weights, outputs probability that we should perform action.
Policy is now a probability distribution of what action we should perform conditioned on the current state.
Sample from the outputs and choose the action with the highest probability.
un the policy for a while. See what actions led to high rewards and increase their probability.
This will handle the case where the rewards are sparse or far away.

2.6.1 - Training Policy Gradient

Initialise the agent
Run a policy until termination
Record all $\langle s, a, r\rangle$ (states, actions and rewards)
Decrease probability of actions that resulted in low reward
Increase the probability of actions that resulted in a high reward.

2.7 - Summary

Explain the explore-exploit dilemma and solutions to multi-armed bandit problems
Explain the relationship between decision-theoretic planning (MDPs) and reinforcement learning
- Reinforcement Learning is an MDP with more unknowns (don't know transition function or reward model)
Explain the difference between model-based and model-free RL
- Model-Based - Trying to learn the model
- Model-Free - Estimating values for each state directly, Q-Learning and SARSA
Explain the difference between on-policy and off-policy RL
- I.e. Q-Learning vs SARSA
- Off-Policy - Exploration based off optimal policy
- On-Policy - Uses policy to choose the next action
Implement and trace basic table-based RL algorithms - Q-Learning and SARSA
- Given a table of every (s, a), estimate the values of each pair
Understand the limitations of state-based reinforcement learning and the use of feature-based RL algorithms including Deep RL
- (s, a) pairs increase exponentially - when should we use feature-based RL instead?
- Linear VFA
- Deep RL for Linear & Non-Linear VFA