Lecture 10 - Learning to Act II - SARSA and VFA for Reinforcement Learning

1.0 - Recap of Reinforcement Learning

Reinforcement learning is an MDP where the transition and/or reward function are not known:
- Set of States (S)
- Set of Actions (A)
- Transition Function $P(s'|s, a)$ or $T(s,a,s')$
- Reward Function $R(s,a,s')$ , $R(s,a)$ or $R(s)$ depending on the environment
- Discount factor $\gamma$
Need to determine the best actions through exploration of the environment.

1.2 - Monte-Carlo Reinforcement Learning

Goal Given a policy, learn the value of the policy $v_\pi$ where the transition function and reward function are unknown
Assumption Episodic MDP
- Each episode (i.e. each run) is guaranteed to terminate within a finite amount of time
- Require finite amount of time as the Monte Carlo update requires the result from the episode.
Loop over
- Generate episode (run our policy in the environment)
- Compute the total discounted reward ("return") for the episode
- Update the value

1.2.1 - Monte-Carlo Update

The goal is to learn the value of the policy $v_\pi$ from episodes of experience under policy $\pi$ .
Suppose we have the following episode.
- For each state (at time step $t$ $t$ ) $S_t$ $S_{t}$ , we have the return $G_t=r_t+\gamma r_{t+1}...\gamma^{n-1}r_n$ $G_{t} = r_{t} + γ r_{t + 1} ... γ^{n - 1} r_{n}$
  - The "true reward" we have received in that run.
- Incrementally update the value after each episode $V(S_t)=V(S_t)+\alpha(G_t-V(S_t))$
Monte-Carlo Policy Evaluation uses empirical mean return instead of expected return
- Using the empirical mean (the value we actually observed) instead of the expected return (the value of $V(S_t)$ )

1.2.2 - Monte Carlo Reinforcement Learning

In Monte-Carlo Reinforcement Learning, we keep track of two values
- $\color{4DAB9A}N(s)\leftarrow N(s)+1$ - The number of times we have visited state $\color{4DAB9A}s$
- $\color{4DAB9A}S(s)\leftarrow S(s)+G_t$ - The reward for each state.
From these two values, we can estimate the mean return:

$V(s)=\frac{S(s)}{N(s)}=\frac{\text{Reward}}{\text{\# Times Visited}}$
Two versions of Monte Carlo RL Update
- First Visit Monte Carlo Only update the value of a state if it is visited the first time in the sampled episode
- Every Visit Monte Carlo Update the value of a state whenever it is visited (regardless at which time step)
This converges to the true value by the law of large numbers.

1.2.3 - Monte Carlo Value Updating

Suppose we see the rewards $1, 2, 3$
- The mean value $\mu=\frac{\sum_v}{\text{\# values}}=\frac{1+2+3}{3}=2$
Then, on the next iteration we get the reward $6$ , so the sequence of rewards are $1, 2, 3, 6$
- How do we update the mean value $\mu$ ?
- Naively, we could do $\mu=\frac{\sum}{N}=\frac{1+2+3+6}{4}=3$
🧠 Generally in Reinforcement Learning problems, we want to update the expected value instead of recomputing it for all previous states.
- However, we could use the following formula $\mu_{new}=\frac{\mu_{old}\times N_{old}+x}{N_{old}+1}$ to define the average iteratively.
$\frac{\mu_{old}\times(N-1)+x}{N}$

$\mu_{new}=\frac{\mu_{old} \times N_{old}+x}{N_{old}+1}$ (Where $x$ is the new observed value)

$=\frac{\mu_{old}\times N-\mu_{old}+x}{N}$

1.2 - Temporal Difference Learning

Don't wait until the end of the episode to update the estimates, update based on the current estimates.
- Updating estimates based on estimates.
One of the most famous Reinforcement Learning approaches
Iteratively reduce the difference between the value or Q-value estimates
Learns from incomplete episodes by bootstrapping unlike Monte Carlo methods

$\hat{V}(S_t)\leftarrow V(S_t)+\alpha({\color{4DAB9A}R_{t+1}+\gamma V(S_{t+1})}-V(S_t))$

$\hat{Q}(s,a)\leftarrow \hat{Q}(s,a)+\alpha({\color{4DAB9A}r+\gamma max_{a'} \hat{Q}(s',a')}-\hat{Q}(s,a))$
Update using estimate of next state

1.3 - Monte Carlo vs Temporal Difference Updating

In both methods, the goal is to learn the value of the policy $v_\pi$ from episodes of experience under policy $\pi$

Incremental Every-Visit MC
- Update the value $V(S_t)$ toward the actual return $\color{4DAB9A}G_t$
  
  $V(S_t)\leftarrow V(S_t)+\alpha({\color{4DAB9A}G_t}-V(S_t))$
Simplest Temporal-Difference Learning Algorithm TD(0)
- Update the value $V(S_t)$ toward the estimated return $\color{4DAB9A}R_{t+1}+\gamma V(S_{t+1})$
  
  $V(S_t)\leftarrow V(S_t)+\alpha({\color{4DAB9A}R_{t+1}+\gamma V(S_{t+1})}-V(S_t))$
- $\color{4DAB9A}R_{t+1}+\gamma V(S_{t+1})$ is called the TD Target
- $\delta_t={\color{4DAB9A}R_{t+1}+\gamma V(S_{t+1})}-V(S_t)$ is called the TD Error
- TD(0) looks one step in advance, but TD( $\lambda$ ) looks multiple steps in advance.

1.3.1 - Sutton & Barto - Monte Carlo vs Temporal Difference Learning

The following was taken from p146 of Sutton & Barto - Reinforcement Learning - An Introduction Link

Left: Changes recommended by monte carlo methods, based on actual outcomes. Right:Changes recommended by Temporal Difference methods (based on observations in successive states)

Note that for the monte carlo methods to update, we have to wait until the episode is complete.
However, in TD methods we can update after every action is performed.
Monte-Carlo methods update each state with respect to the actual outcome
TD methods updates the estimated value of the next state with the actual value of the next state
That is, in these examples the dots are the estimates and the dotted lines represent the actual values.

1.3.2 - Multi-Step TD Learning (TD( $\lambda$ ))

We can actually use predictions for more steps in advance
"n-step TD"
TD( $\lambda$ ) as another approach.
If we perform $\infty$ -step TD, it is the same as Monte Carlo

1.4 - Q-Learning

With known reward and state-transition functions:

$Q^*(s,a)=\sum_{s'}P(s'|a,s)(R(s,a,s')+\gamma \text{max}_aQ^*(s',a))$

Idea Store $Q(\text{state}, \text{action})$ in a table, update this as in asynchronous value iteration, but using experience, (empirical probabilities and rewards).
- Suppose the agent has an experience $(s, a, r, s')$
- This provides one piece of data to update $Q(s,a)$
- An experience $(s, a, r, s')$ provides a new estimate for the value of $Q^*(s,a)$
  
  $\text{TD Target}=r+\gamma\ max_{a'}Q(s',a')$
  
  $\hat{Q}^*(s,a)=r+\gamma\text{max}_{a'}Q(s',a') = r\text{eward} + \text{discounted future Q(s,a) value}$
- Note: We potentially know nothing about the environment, but we can initialise the Q-values to all zeros.
- This can be used in the Temporal Differences formula:
  
  $\hat{Q}(s,a)\leftarrow \hat{Q}(s,a)+\alpha(r+\gamma\text{max}_{a'}\hat{Q}(s',a')-\hat{Q}(s,a))$
- Therefore, based on empirical values, we can update our Q value.
Iteratively estimate the table $\hat{Q}(s,a)$ from experience:
Initialise $\hat{Q}(s,a)$ arbitrarily (e.g. all zeros)
Observe the current state s
Repeat for each episode, until convergence: {Get the policy}
- Select and carry out an action a
- Observe reward r and state s'
- $\hat{Q}(s,a)\leftarrow \hat{Q}(s,a)+\alpha(r+\gamma\text{max}_{a'}\hat{Q}(s',a')-\hat{Q}(s,a))$
- $s\leftarrow s'$
For each state s {Extract the policy - find the action that optimises the reward for each state}
- $\pi(s)=argmax_a\hat{Q}(s,a)$ # Greedy approach
- return $\pi, \hat{Q}$

1.4.1 - TD Learning

🧠 Q-Learning is a type of TD-Learning

Update Rule: $V(S_i)=\alpha[r_i+\alpha V(s_{i+1})]=V(s_{i})+{\color{529CCA}\alpha[r_i+\gamma V(S_{i+1})-V(s_i)}]$ $V (S_{i}) = α [r_{i} + α V (s_{i + 1})] = V (s_{i}) + α [r_{i} + γV (S_{i + 1}) - V (s_{i})]$
- Temporal Difference component

1.4.2 - Properties of Q-Learning

Q-learning converges to an optimal policy, no matter what the agent does, as long as it tries each action in each state enough times.
But, what should the agent do? Use an exploration strategy to:
- Exploit When in state s, select an action that maximises $Q(s,a)$
- Explore Select another action (either arbitrarily or according to some probability)
- Greedy-in-the-limit-with-finite-exploration (GLIE) policies
- Choose these strategies from the Multi-Armed Bandits discussed earlier.

1.4.3 - Problems with Q-Learning

It does one backup between each experience - is this appropriate for a robot interacting with the real world?
- An agent might be able to make better use of the data by doing multi-step backups or building a model and using MDP methods to determine the optimal policy
- Perform multi-step backups as in TD
It learn separately for each state, might be able to learn better over collections of states (feature-based RL)

1.5 - SARSA

🧠 Incorporate the exploration strategy with SARSA

1.5.1 - On-Policy Learning

Q-Learning does off-policy learning - it learns the value of an optimal policy, no matter what it does
This could be bad if the exploration policy is dangerous
On-learning policy learns the value of the policy being followed.
- e.g. act greedily 80% of the time and act randomly 20% of the time
Why? If the agent is actually going to explore, it may be better to optimise the policy it is going to do
SARSA uses the experience $(s, a, r, s', {\color{529CCA}a'})$ $(s, a, r, s^{'}, a^{'})$ to update $Q(s,a)$ $Q (s, a)$
- That is, SARSA uses the actual action that was performed to update the policy.
- This leads to better usage of data collected - i.e. less costly mistakes made - learns from mistakes

1.5.2 - SARSA Pseudocode

Initialise $\hat{Q}(s,a)$ arbitrarily (e.g. all zeros)
Observe the current state s
Select an action a
Repeat for each episode, until convergence:
- carry out an action a
- Observe reward r and state s'
- $\hat{Q}(s,a)\leftarrow \hat{Q}(s,a)+\alpha(r+\gamma\hat{Q}(s',a')-\hat{Q}(s,a))$ # here, modifies based on the action that was perf
- $s\leftarrow s'$
- $a\leftarrow a'$ # Here we're also looking at the action performed - the agent remembers the action that was performed.

In this, use a MAB or Epsilon-Greedy policy to balance exploration and exploitation

1.6 - Q-Learning vs SARSA

Reward for falling into / off the cliff is -100, whereas the reward of moving a block is -1
Goal of MDPs and reinforcement learning is to maximise the reward
Sum of rewards for SARSA is higher - when Q-Learning is still being performed, it will still fall into the cliff, leading to a smaller overall reward.

2.0 - Value Function Approximation (VFA) for RL

David Silver's Course on RL, Lecture 6: Value Function Approximation
- https://www.youtube.com/watch?v=UoPei5o4fps
Richard Sutton & Andrew Barto: Chapter 9.1
- RLBook
Russell & Norvig: Chapter 21.4
Poole & Mackworth 11.3.9 https://artint.info/html/ArtInt_271.html

2.1 - Large-Scale Reinforcement Learning

MDPs and Reinforcement learning should be used to solve large-scale problems. For example:
- Backgammon: $10^{20}$ states - $\color{529CCA}10^{20}\times\text{num\_actions}\times 10^{20}$
- Chess: $10^{30}$ to $10^{40}$ states
- Computer Go: $10^{170}$ states
- Quad-copter, bipedal robot: Enormous continuous state space
Tabular methods (that perform computation on every explicit state) cannot handle this.
If the state space is large, several problems arise
- The table of Q-value estimates can get very large
- Q-value updates can be slow to propagate.
- High-reward states can be hard to find
- State space grows exponentially with feature dimension.

2.3 - Q-Learning

Q-Learning produces a table of Q-Values (values for each state-action pair)

2.3.1 - Pacman Example

While the two images shown are very clearly different states, their associated value should be roughly equal - the game is about to end as the agent is surrounded by ghosts and will die soon.

We could represent these things as features:
- Distance to closest ghost
- Distance to closest dot (food)
- Number of ghosts
- $1/(\text{dist to dot})^2$
- Is Pac-Man in a tunnel [0, 1]
- Can also describe a q-state (s, a) with features (e.g. action moves closer to food)
- Amount of food remaining

2.4 - Linear VFA

From this, our Q-Value:

$Q(s,a)=w_1f_1(s,a)+w_2f_2(s,a)+\cdot\cdot\cdot+w_nf_n(s,a)$
Q-learning with linear Q-functions

$\text{episode}=(s,a,r,s')\\ \text{difference}=[r+\gamma max_{a'}Q(s',a')]-Q(s,a)\\ Q(s,a)=Q(s,a)+\alpha [\text{difference}]\ \ \ \ \ \ \ \ \ \ \ \ \ \text{(exact Qs)}\\ {\color{529CCA}w_i\leftarrow w_i+\alpha [\text{difference}]f_i(s,a)\ \ \ \ \ \ \ \ \ (\text{Approximate Qs})}$
- Convert states → features → value approximations

2.4.1 - SARSA with Linear VFA

Given $\gamma$ $γ$ as the discount factor and $\eta$ $η$ as step size
- Assign weights $\bar{w}=(w_0, \cdot\cdot\cdot, w_n)$ arbitrarily
- Observe the current state $s$
- Select an action $a$
- repeat each episode until convergence:
  - carry out action a
  - observe reward r and state s'
  - select action a' (using a policy based on $Q_{\bar{w}}$ which is the Q-table indexed by features)
  - Let $\delta=Q_{\bar{w}}(s',a')-Q_{\bar{w}}(s,a)$
  - for i=0 to n
    - $\color{529CCA}w_i\leftarrow w_i + \eta\delta F_i(s,a)$
  - $s\leftarrow s'$
  - $a\leftarrow a'$
intuition - this is performing gradient descent across features, effectively adjusting feature weights to reduce the difference between the sampled value and the estimated expected value.
keep performing until the weights converge.

2.4.2 - Advantages and Disadvantages of VFA

Advantages

Dramatically reduces the size of the Q-table
States will share many features
Allows generalisation to unvisited states
Makes behaviour more robust - making similar decisions in similar states
Handles continuous state space

Disadvantages

Requires feature selection - this often must be done by hand
Restricts the accuracy of the learned rewards - learned reward is not precisely for that state.
The true reward function may not be linear in the features.

2.5 - Function Approximation

Estimate the true value function $V^\pi(s)$ using a parameterizable approximate value
$\hat{v}(s,\bold{w})\approx V^\pi(s)$ or $\hat{q}(s,a,\bold{w})\approx Q^\pi(s,a)$
$\bold{w}\in\R^n$ is a parameter vector or weights
Typically the number of features n is less than the number of states.

2.5.1 - Types of Function Approximators

Linear
Neural Network
Radial Basis Functions
Decision Trees
Fourier / wavelet bases
Want a differentiable function approximator
Require a training method that is suitable for non-stationary and non-iid data (independent and identically distributed data)
- In RL, data is not independent - actions and their related rewards are based on previous states.

2.5.2 - Gradient Descent

Let $J(\bold{w})$ be a differentiable function of the parameter vector $\bold{w}$
Define the gradient of $J(\bold{w})$ to be:

$\scriptsize{\nabla_w J(\bold{w})=\begin{bmatrix} \frac{\partial J(\bold{w})}{\partial w_1} \\ \vdots \\ \frac{\partial J(\bold{w})}{\partial w_n} \end{bmatrix}}$
To find the local minimum of $J(\bold{w})$ , adjust $\bold{w}$ in the direction of the negative gradient - not guaranteed to be the global minimum

$\Delta w=-\frac{1}{2}\alpha\nabla_wJ(\bold{w})$

where $\alpha$ is the step size

2.5.3 - Value Function Approximation by SGD

Goal: Find $\bold{w}$ that minimises mean-squared error between approximate value function, $\hat{v}(s, \bold{w})$ and the true value function $v_\pi(s)$ :

$J(\bold{w})=𝔼_\pi[(v_\pi(s)-\hat{v}(S, \bold{w}))^2]$
Incrementally update the parameter vector to find (local) optimum
Transfer the idea of incremental learning steps from the tabular case:

$\hat{v}(s)\leftarrow\hat{v}(S)+\alpha[v_\pi(S)-\hat{v}(S)]$
To function approximation using a gradient descent update:

$w\leftarrow-\alpha\nabla_wJ(\bold{w})$
- i.e. in the Pacman example, we notice that distance to the closest goal is inversely proportional to the value of the state.
- Using the gradient, we can determine that a given feature's proportionality to a state's features
- Weights adjust until they converge such that the approximated value function is a 'good enough' estimate of each state's value.

2.6 - Representing States using Feature Vectors

Represent state with a feature vector

$\scriptsize x(S)=\begin{bmatrix}{x_1(S)}\\{\vdots}\\{x_n(S)}\end{bmatrix}$
Where $S$ is a state.
e.g.
- Distance from agent to landmarks / walls
- Piece configurations in chess
- In Pacman: #-of-ghosts-1-step-away

2.6.1 - Table Lookup Features

Table-lookup is just a special case of linear VFA
- Using table lookup features:
$\scriptsize x^{\text{table}}(S)=\begin{pmatrix} 1(S=s_1)\\\vdots\\1(S=s_n) \end{pmatrix}$
- The parameter vector $\bold{w}$ gives the value of each individual state
$\scriptsize\hat{v}(S,\bold{w})=\begin{pmatrix} 1(S=s_1)\\\vdots\\1(S=s_n) \end{pmatrix}\cdot \begin{pmatrix}w_1\\\vdots\\w_n\end{pmatrix}$
- Where $1(S=s_n)$ is an indicator function for whether a feature is 'turned on' for a state.

2.7 - Incremental Prediction Algorithms

Up to this point, we have assumed that the true value function $v_\pi(s)$ given by supervisor (i.e. is known)
But in RL, there is no supervisor, only rewards
In practice, we substitute a target for $v_\pi(s)$
- For Monte Carlo, the target is the return $G_t$
$\Delta \bold{w}=\alpha({\color{FF7369}G_t}-\hat{v}(S_t,\bold{w}))\nabla_w\hat{v}(S_t, \bold{w})$
- For TD(0) the target is the TD target $\color{529CCA}R_{t+1}+\gamma \hat{v}(S_{t+1}, \bold{w})$
$\Delta\bold{w}=\alpha({\color{529CCA}R_{t+1}+\gamma \hat{v}(S_{t+1}, \bold{w})}-\hat{v}(S_t,w))\nabla_w\hat{v}(S_t,w)$
- For TD( $\lambda$ ), the target is the $\lambda$ -return $G_t^\lambda$
$\Delta \bold{w}=\alpha({\color{FF7369}G_t^\lambda}-\hat{v}(S_t,\bold{w}))\nabla_w\hat{v}(S_t, \bold{w})$

2.8 - Monte Carlo with Value Function Approximation

Return $G_t$ is an unbiased, noisy sample of the true value of the value $v_\pi(S_t)$
Can therefore apply supervised learning to the training data:

$\lang S_1, G_1\rang, \lang S_2, G_2\rangle,..., \langle S_T,G_t \rangle$
For example, using linear Monte-Carlo policy evaluation

$\Delta\bold{w}=\alpha(G_T-\hat{v}(S_t,\bold{w}))\nabla_w\hat{v}(S_t, \bold{w})\\=\alpha(G_t)-\hat{v}(S,t\bold{w}))\times(S_t)$
Monte-Carlo evaluation converges to a local optimum even when using non-linear VFA

2.9 - TD Learning with Value Function Approximation

The TD-target $R_{t+1}+\gamma\hat{v}(S_{t+1}, \bold{w})$ is a biased sample of true value $v_\pi(S_t)$
Can still apply supervised learning to the training data.

$\lang S_1, R_2+\gamma\hat{v}(S_2, \bold{w})\rang, \lang S_2, R_3+\gamma\hat{v}(S_3, \bold{w})\rangle,..., \langle S_T,R_4+\gamma\hat{v}(S_4, \bold{w}) \rangle$

For example, using linear TD(0)

$\Delta\bold{w}=\alpha(R+\gamma\hat{v}(S', \bold{w}), -\hat{v}(S, \bold{w}))\nabla_w\hat{v}(S,\bold{w})\\=\alpha\delta \bold{x}(S)$
In practice, Linear TD(0) converges to a close global optimum
- Doesn't converge exactly to the global optimum - training based on estimates of estimates (and these will be biased).

2.10 - Control with VFA

When performing policy iteration, we iterate between two different steps - evaluating our policy, and updating our policy.
When using VFA, we don't actually have the true value, only an estimate
- As we iterate through evaluating and updating our policy, we move closer toward the optimum policy.

2.11 - Q-Value (Action-Value) Function Approximation

Rather than just updating the values, we want to update the Q-values of each state
Approximate the action-value function

$\hat{q}(S,A,\bold{w})\approx q_\pi(S,A)$
Minimise the mean-squared error between approximate action-value function $\hat{q}(S, A, \bold{w})$ and the true action-value function $q_\pi(S,A)$

$J(\bold{w})=𝔼_\pi[(v_\pi(s)-\hat{v}(S, \bold{w}))^2]$
Use stochastic gradient descent to find a local maximum

$-\frac{1}{2}\nabla_wJ(\bold{w})=(q_\pi(S,A)-\hat{q}(S, A, \bold{w}))\nabla_w\hat{q}(S, A, \bold{w})\\\Delta_\bold{w}=\alpha(q_\pi(S, A)-\hat{q}(S, A, \bold{w}))\nabla_w\hat{q}(S, A, \bold{w})$

2.12 - Linear Q-Value (Action-Value) Function Approximation

Represent state and action by a feature vector

$\scriptsize x(s,a)=\begin{pmatrix}x_1(S,A)\\\vdots\\x_n(S,A)\end{pmatrix}$

Represent action-value function by linear combination of features

$\hat{q}(S,A,\bold{w})=x(S,A)^\text{T}\bold{w}=\sum^n_{j=1}x_j(S,A)\bold{w}_j$
Perform stochastic gradient descent update

$\nabla_w\hat{q}(S, A, \bold{w})=\bold{x}(S,A)\\\Delta_w=\alpha(q_\pi(S,A)-\hat{q}(S,A,\bold{w}))\bold{x}(S,A)$
- That is, update the weights instead of the q-values