Update the value V(St) toward the estimated return Rt+1+γV(St+1)
V(St)←V(St)+α(Rt+1+γV(St+1)−V(St))
Rt+1+γV(St+1) is called the TD Target
δt=Rt+1+γV(St+1)−V(St) is called the TD Error
TD(0) looks one step in advance, but TD(λ) looks multiple steps in advance.
1.3.1 - Sutton & Barto - Monte Carlo vs Temporal Difference Learning
The following was taken from p146 of Sutton & Barto - Reinforcement Learning - An Introduction Link
Left: Changes recommended by monte carlo methods, based on actual outcomes. Right:Changes recommended by Temporal Difference methods (based on observations in successive states)
Note that for the monte carlo methods to update, we have to wait until the episode is complete.
However, in TD methods we can update after every action is performed.
Monte-Carlo methods update each state with respect to the actual outcome
TD methods updates the estimated value of the next state with the actual value of the next state
That is, in these examples the dots are the estimates and the dotted lines represent the actual values.
1.3.2 - Multi-Step TD Learning (TD(λ))
We can actually use predictions for more steps in advance
"n-step TD"
TD(λ) as another approach.
If we perform ∞-step TD, it is the same as Monte Carlo
1.4 - Q-Learning
With known reward and state-transition functions:
Q∗(s,a)=∑s′P(s′∣a,s)(R(s,a,s′)+γmaxaQ∗(s′,a))
Idea Store Q(state,action) in a table, update this as in asynchronous value iteration, but using experience, (empirical probabilities and rewards).
Suppose the agent has an experience (s,a,r,s′)
This provides one piece of data to update Q(s,a)
An experience (s,a,r,s′) provides a new estimate for the value of Q∗(s,a)
TD Target=r+γmaxa′Q(s′,a′)
Q^∗(s,a)=r+γmaxa′Q(s′,a′)=reward+discounted future Q(s,a) value
Note: We potentially know nothing about the environment, but we can initialise the Q-values to all zeros.
This can be used in the Temporal Differences formula:
Q^(s,a)←Q^(s,a)+α(r+γmaxa′Q^(s′,a′)−Q^(s,a))
Therefore, based on empirical values, we can update our Q value.
Iteratively estimate the table Q^(s,a) from experience:
Initialise Q^(s,a) arbitrarily (e.g. all zeros)
Observe the current state s
Repeat for each episode, until convergence: {Get the policy}
Select and carry out an action a
Observe reward r and state s'
Q^(s,a)←Q^(s,a)+α(r+γmaxa′Q^(s′,a′)−Q^(s,a))
s←s′
For each state s {Extract the policy - find the action that optimises the reward for each state}
Choose these strategies from the Multi-Armed Bandits discussed earlier.
1.4.3 - Problems with Q-Learning
It does one backup between each experience - is this appropriate for a robot interacting with the real world?
An agent might be able to make better use of the data by doing multi-step backups or building a model and using MDP methods to determine the optimal policy
Perform multi-step backups as in TD
It learn separately for each state, might be able to learn better over collections of states (feature-based RL)
1.5 - SARSA
🧠 Incorporate the exploration strategy with SARSA
1.5.1 - On-Policy Learning
Q-Learning does off-policy learning - it learns the value of an optimal policy, no matter what it does
This could be bad if the exploration policy is dangerous
On-learning policy learns the value of the policy being followed.
e.g. act greedily 80% of the time and act randomly 20% of the time
Why? If the agent is actually going to explore, it may be better to optimise the policy it is going to do
SARSA uses the experience (s,a,r,s′,a′) to update Q(s,a)
That is, SARSA uses the actual action that was performed to update the policy.
This leads to better usage of data collected - i.e. less costly mistakes made - learns from mistakes
1.5.2 - SARSA Pseudocode
Initialise Q^(s,a) arbitrarily (e.g. all zeros)
Observe the current state s
Select an action a
Repeat for each episode, until convergence:
carry out an action a
Observe reward r and state s'
Q^(s,a)←Q^(s,a)+α(r+γQ^(s′,a′)−Q^(s,a)) # here, modifies based on the action that was perf
s←s′
a←a′ # Here we're also looking at the action performed - the agent remembers the action that was performed.
In this, use a MAB or Epsilon-Greedy policy to balance exploration and exploitation
1.6 - Q-Learning vs SARSA
Reward for falling into / off the cliff is -100, whereas the reward of moving a block is -1
Goal of MDPs and reinforcement learning is to maximise the reward
Sum of rewards for SARSA is higher - when Q-Learning is still being performed, it will still fall into the cliff, leading to a smaller overall reward.
2.0 - Value Function Approximation (VFA) for RL
David Silver's Course on RL, Lecture 6: Value Function Approximation
MDPs and Reinforcement learning should be used to solve large-scale problems. For example:
Backgammon: 1020 states - 1020×num_actions×1020
Chess: 1030 to 1040 states
Computer Go: 10170 states
Quad-copter, bipedal robot: Enormous continuous state space
Tabular methods (that perform computation on every explicit state) cannot handle this.
If the state space is large, several problems arise
The table of Q-value estimates can get very large
Q-value updates can be slow to propagate.
High-reward states can be hard to find
State space grows exponentially with feature dimension.
2.3 - Q-Learning
Q-Learning produces a table of Q-Values (values for each state-action pair)
2.3.1 - Pacman Example
While the two images shown are very clearly different states, their associated value should be roughly equal - the game is about to end as the agent is surrounded by ghosts and will die soon.
We could represent these things as features:
Distance to closest ghost
Distance to closest dot (food)
Number of ghosts
1/(dist to dot)2
Is Pac-Man in a tunnel [0, 1]
Can also describe a q-state (s, a) with features (e.g. action moves closer to food)
select action a' (using a policy based on Qwˉ which is the Q-table indexed by features)
Let δ=Qwˉ(s′,a′)−Qwˉ(s,a)
for i=0 to n
wi←wi+ηδFi(s,a)
s←s′
a←a′
intuition - this is performing gradient descent across features, effectively adjusting feature weights to reduce the difference between the sampled value and the estimated expected value.
keep performing until the weights converge.
2.4.2 - Advantages and Disadvantages of VFA
Advantages
Dramatically reduces the size of the Q-table
States will share many features
Allows generalisation to unvisited states
Makes behaviour more robust - making similar decisions in similar states
Handles continuous state space
Disadvantages
Requires feature selection - this often must be done by hand
Restricts the accuracy of the learned rewards - learned reward is not precisely for that state.
The true reward function may not be linear in the features.
2.5 - Function Approximation
Estimate the true value function Vπ(s) using a parameterizable approximate value
v^(s,w)≈Vπ(s) or q^(s,a,w)≈Qπ(s,a)
w∈Rn is a parameter vector or weights
Typically the number of features n is less than the number of states.
2.5.1 - Types of Function Approximators
Linear
Neural Network
Radial Basis Functions
Decision Trees
Fourier / wavelet bases
Want a differentiable function approximator
Require a training method that is suitable for non-stationary and non-iid data (independent and identically distributed data)
In RL, data is not independent - actions and their related rewards are based on previous states.
2.5.2 - Gradient Descent
Let J(w) be a differentiable function of the parameter vector w
Define the gradient of J(w) to be:
∇wJ(w)=∂w1∂J(w)⋮∂wn∂J(w)
To find the local minimum of J(w), adjust w in the direction of the negative gradient - not guaranteed to be the global minimum
Δw=−21α∇wJ(w)
where α is the step size
2.5.3 - Value Function Approximation by SGD
Goal: Find w that minimises mean-squared error between approximate value function, v^(s,w) and the true value function vπ(s):
J(w)=Eπ[(vπ(s)−v^(S,w))2]
Incrementally update the parameter vector to find (local) optimum
Transfer the idea of incremental learning steps from the tabular case:
v^(s)←v^(S)+α[vπ(S)−v^(S)]
To function approximation using a gradient descent update:
w←−α∇wJ(w)
i.e. in the Pacman example, we notice that distance to the closest goal is inversely proportional to the value of the state.
Using the gradient, we can determine that a given feature's proportionality to a state's features
Weights adjust until they converge such that the approximated value function is a 'good enough' estimate of each state's value.
2.6 - Representing States using Feature Vectors
Represent state with a feature vector
x(S)=x1(S)⋮xn(S)
Where S is a state.
e.g.
Distance from agent to landmarks / walls
Piece configurations in chess
In Pacman: #-of-ghosts-1-step-away
2.6.1 - Table Lookup Features
Table-lookup is just a special case of linear VFA
Using table lookup features:
xtable(S)=1(S=s1)⋮1(S=sn)
The parameter vector w gives the value of each individual state
v^(S,w)=1(S=s1)⋮1(S=sn)⋅w1⋮wn
Where 1(S=sn) is an indicator function for whether a feature is 'turned on' for a state.
2.7 - Incremental Prediction Algorithms
Up to this point, we have assumed that the true value function vπ(s) given by supervisor (i.e. is known)
But in RL, there is no supervisor, only rewards
In practice, we substitute a target for vπ(s)
For Monte Carlo, the target is the return Gt
Δw=α(Gt−v^(St,w))∇wv^(St,w)
For TD(0) the target is the TD target Rt+1+γv^(St+1,w)
Δw=α(Rt+1+γv^(St+1,w)−v^(St,w))∇wv^(St,w)
For TD(λ), the target is the λ-return Gtλ
Δw=α(Gtλ−v^(St,w))∇wv^(St,w)
2.8 - Monte Carlo with Value Function Approximation
Return Gt is an unbiased, noisy sample of the true value of the value vπ(St)
Can therefore apply supervised learning to the training data:
⟨S1,G1⟩,⟨S2,G2⟩,...,⟨ST,Gt⟩
For example, using linear Monte-Carlo policy evaluation