1.0 - Overview of Module 4 - Learning to Act

1.1 - Expectations

At the end of the class you should be able to:

Explain the explore-exploit dilemma and solutions
Explain the relationship between decision-theoretic planning (MDPs) and reinforcement learning
Explain the difference between model-based and model-free reinforcement learning
Implement basic state-based reinforcement learning algorithms: Q-learning and SARSA
Explain the difference between on-policy and off-policy reinforcement learning

1.2 - Reinforcement Learning

Reinforcement learning is an MDP where the transition and/or reward function are not known:
- Set of States (S)
- Set of Actions (A)
- Transition Function $P(s'|s, a)$ or $T(s,a,s')$
- Reward Function $R(s,a,s')$ , $R(s,a)$ or $R(s)$ depending on the environment
- Discount factor $\gamma$
Need to determine the best actions through exploration of the environment.
Gather data as we are learning
Which data to use affects learning performance
The agent has control over which data to use
Episode Sequence of states from starting state to terminal state.
- Interaction with the environment done through episodes.
The agent needs to figure out (1) how the environment works (2) how to optimally solve it.

1.3 - Assumptions on Environment for Reinforcement Learning (Module 4)

Flat instead of modular or hierarchical
Explicit states or features
Indefinite stage or finite stage instead of static or finite stage
Fully observable instead of partially observable.
Stochastic dynamics instead of deterministic dynamics
Complex preferences instead of goals
Single agent instead of multiple agents
Knowledge is learned instead of knowledge is given
Perfect Rationality instead of bounded rationality

1.4 - Examples of Reinforcement Learning

Game Reward winning the game, punish losing the game
Dog Reward obedience, punish destructive behaviour
Robot Reward task completion, punish dangerous behaviour

1.5 - Experiences

Gather data by interacting with the world, given a sequence of experiences:

$\text{state, action, reward, state, action, reward, ...}$
The agent has to choose its action as a function of its history
At any time it must decide whether to:
- Explore to gain more knowledge - try performing a new action
- Exploit knowledge it has already discovered - continue to perform an action that we know works well
1.6 - Why is Reinforcement Learning Hard?
What actions are responsible for a reward may have occurred a long time before the reward was received.
The long-term effect of an action depends on what the agent will do in the future
The explore-exploit dilemma: at each time should the agent be greedy or inquisitive?

1.6.1 - Reinforcement Learning Approaches

Model-Based vs Model-Free - What is being learned?
- Model-Based
  - Use data to learn the missing components of the MDP problem, i.e. Transition & Reward functions
  - Once we know the Transition & R, solve the MDP problem
  - Indirect learning, but generally most efficient use of data.
- Model-Free
  - Use data to learn the value function & policy directly
  - Direct learning, generally not the most efficient use of data, but usually fast.
Passive vs Active - How is the data being generated?
- Passive Fixed policy
  
  ❗ Learn the transition and reward function through passive observation
  - The agent observes the world by following the policy OR given a data set (e.g. from video), the agent observes to learn value function or the model (Transition and Reward functions)
- Active Classical Reinforcement Learning Problem
  
  ❗ Learn the transition and reward function through performing action and observing the response
  - The agent selects what action to perform, and the action performed determines the data it receives, which then determines how fast the agent converges to the correct MDP model
  - Exploration vs exploitation dilemma in active reinforcement learning approach.
- Note that we can have a combination of the two.

1.7 - Reinforcement Learning - Main Approaches

Approach One Learn a model consisting of (1) state transition function $P(s'|a,s)$ , (2) reward function $R(s,a,s')$ and solve this as a MDP.

Approach Two Learn $Q^*(s,a)$ and use this to guide the action chosen - use this function to determine how good the $(s,a)$ pair is.

Approach Three Search through a space of policies (controllers)

In all of these cases, we face the problem of exploration vs exploitation

2.0 - Exploration vs Exploitation - Multi-Armed Bandits

2.1 - Exploration vs Exploitation

All the methods that follow will have some convergence condition like "assuming we visit each state enough", or "taking actions according to some policy"
A fundamental question - If we don't know the system dynamics, should we take actions that will give us more information, or exploit current knowledge to perform as best we can?
- If we use a greedy policy, bad initial estimates in the first few cases can drive policy into sub-optimal region, and never explore further
💡 Instead of acting according to the greedy policy, act according to a sampling strategy that will explore state-action pairs until we get a "good" estimate of the function.

2.2 - Multi-Armed Bandit Problem

🧠 Assumption: 1. The choice of several arms / machines 2. Each arm pull is independent of other arm pulls 3. Each arm has a fixed, unknown average payoff

Which arm has the best average payoff? How do we maximise the sum of rewards over time?

We can determine the average payoff from each arm through sampling.

Consider a row of three poker machines.

$R(\text{win})=1$ for all machines

$P(A, \text{win})=0.6, P(B, \text{win})=0.55, P(C, \text{win})=0.4$

The expected utility theory tells us that A is the best arm, but we don't know that!

We want to explore all arms BUT if we explore too much, we may sacrifice the reward we could have gotten
We want to exploit promising arms more often, BUT if we exploit too much, we can get stuck with sub-optimal values because of a lack of exploration
We want to minimise regret → loss from playing non-optimal arm
Need to balance between exploration and exploitation

2.4 - Exploration Strategies

An exploration strategy is a rule for choosing which arm to play at some time step $t$ given arm selections and outcomes of previous trials at times $0, 1, ..., t-1$ (also called a policy in the MAB literature, but we'll reserve that word for MDP/RL state-action policies)
- $\epsilon$ -greedy strategy choose random action with probability $\epsilon$ and choose a best action with probability $1-\epsilon$
  - Choose $\epsilon\lt0.5$ (so greater probability of choosing best action)
- Softmax/Boltzmann Strategy In state $s$ , choose action $a$ with probability $\frac{e^{Q(s,a)/\tau}}{\sum_a{e^{Q(s,a)/\tau}}}$ where $\tau>0$ is the temperature coefficient
  - (The term on the bottom sums to 1?)
- EXP3 / Exponential-Weight Algorithm for Exploration and Exploitation is an algorithm used in adversarial cases.
- Optimism in the face of uncertainty - Initialise $Q$ to values that encourage exploration
- Upper Confidence Bound Take into account average + variance information

2.4.1 - Epsilon-Greedy Exploration

🧠 If we are just considering exploitation and exploration, we *do the thing that rewards you most of the time, and sample other paths with some small probability*

Assign a weight to each sampling strategy
Start with equal weight for each strategy
Strategy with the highest weight is selected with probability $(1-\epsilon)$ .
The rese are selected with probability $\epsilon/N$ where $N$ is the number of strategies available
A typical value might be $\epsilon=0.1$

2.4.2 - Upper Confidence Bound

UCB1 algorithm (Auer et al, 2002)
1. Pull every arm $k\ge1$ times, then
2. at each time step, choose arm $i$ that maximises UCB1 formula for the upper confidence bound

UCB1_i=\hat{v_i}+c\sqrt{\frac{ln(N)}{n_i}}

$\hat{v_i}$ is the current value (mean) estimate for the arm $i$

C is the tunable parameter

N is the total number of arm pulls

$n_i$ is the number of times arm $i$ has been pilled

$\hat{v_i}$ is the exploitation term

$c\sqrt{\frac{ln(N)}{n_i}}$ is the exploration term

UCB1_i=\hat{v_i}+c\sqrt{\frac{ln(N)}{n_i}}

A higher estimated reward $\hat{v_i}$ is better (exploit)
Expect "true value" to be in some confidence interval around $v_i$
Confidence interval is large when the number of trials $n_i$ is small, shrinks in proportion to $\sqrt{n_i}$
High uncertainty about move → larger exploitation term
Sample more if number of trials is much less than the total number of trials.

3.0 - Model-Based Reinforcement Learning

3.1 - Asynchronous Value Iteration for MDPs (Storing Q(s,a))

If we knew the model, we would have:
- A reward function R(s,a,s')
- Transition function P(s'|s,a) or T(s,a,s')
And we could use value iteration to compute the optimal policy:

🧠 Initialise a table of Q values, $Q(s,a)$ arbitrarily Repeat forever:

Select state s, action a

$Q(s,a)\leftarrow\sum_{s'} P(s'|s,a)(R(s,a,s')+\gamma max_{a'}Q(s',a'))$

In this case, we store the Q values of everything (all state, action pairs)
The catch is, we don't know $P(s'|s,a)$ or $R(s,a,s')$

3.1.1 - Unknown Transition and Reward Function

When we don't have $P(s'|s,a)$ or $R(s,a,s')$ , there is a simple approach: just estimate the MDP from the observed data
Suppose the agent acts in the world (according to some policy) and observes experience:

$s_0,a_0,r_0, s_1, a_1,r_1, ..., s_n, a_n, r_n$
From the empirical estimate of the MDP via the counts:

$P(s'|s,a)=\frac{\sum^m_{i=0}\Iota(s_1=s, a_1=a, s_{i+1}=s')}{\sum_{i=0}^m\Iota(s_i=s, a_i=a)}=\frac{\text{\# times moved from }s \rightarrow s' \text{ by performing a}}{\text{\# times action }a \text{ performed on state }s}$

\hat{R}(s)=\frac{\sum^m_{i=0}\Iota(s_i=s)r_i}{\sum^m_{i=0}\Iota(s_i=s)}=\frac{\text{\# times in state }s \times \text{immediate reward}}{\text{\# times in state }s}=\text{average reward for state }s

Where $\Iota(\cdot)$ is an indicator function, $=1$ if the condition is true
Now solve the MDP $\langle S, A, \hat{P}, \hat{R}\rangle$ e.g. using value iteration

3.3.2 - Model-Based Reinforcement Learning

Model-Based Reinforcement Learning will converge to correct MDP (and hence correct value function / policy) given enough samples of each state
How can we ensure that we get the "right" samples? (This is a challenging problem for all methods we present here)
Advantages (informally) Makes efficient use of data
Disadvantages Requires that we build the actual MDP models, which is not much help if the state space is too large.
- Building MDP models → Requires constructing the reward function and transition function

4.0 - Q-Learning

4.1 - Temporal Differences

Suppose we have a sequence of values $v_1, v_2, v_3, ...$
and we want a running estimate of the average of the first $k$ values

$A_k=\frac{v_1+\cdot\cdot\cdot+v_k}{k}$
Suppose we know $A_{k-1}$ and a new value $v_k$ arrives

$A_k=\frac{v_1+...+v_{k-1}+v_k}{k}$

$kA_k={\color{529CCA}v_1+...+v_{k-1}}+v_k$

$A_k=\frac{k-1}{k}A_{k-1}+\frac{1}{k}v_k$

${\color{529CCA}v_1+...+v_{k-1}}=A_{k-1}\times(k-1)$

Let $\alpha_k=\frac{1}{k}$

$A_k=(1-\alpha_k)A_{k-1}+\alpha_k v_k={\color{529CCA}A_{k-1}}+{\color{4DAB9A}\alpha_k(v_k-A_{k-1})}$ (1) - Temporal Differences formula

Note that in this formula, the previous term is given by $\color{529CCA}A_{k-1}$ and the next term as $\color{4DAB9A}\alpha_k(v_k-A_{k-1})$ .

The new term $\color{4DAB9A}v_k-A_{k-1}$ is also known as the temporal difference error / TD-error - how different the new value $v_k$ is from the old prediction $A_{k-1}$

At this step, we update the previous estimate $A_{k-1}$ by the constant $\alpha_k$ multiplied by the TD error $\color{4DAB9A}v_k-A_{k-1}$

That is, we update the estimate $A_{k-1}$ with the difference between the estimate and the value observed $\color{4DAB9A}v_k$
Often we use this update with $\alpha$ fixed
We can guarantee convergence to average if

$\sum^\infty_{k=1}(a_k)=\infty$

$\sum^\infty_{k=1}\alpha_k^2\lt0$

e.g. if $\alpha_k=k^{-1}$ or $\alpha_k=a(b+k)^{-1}$

4.2 - Q-Learning Implementation

With known reward and state-transition functions:

$Q^*(s,a)=\sum_{s'}P(s'|a,s)(R(s,a,s')+\gamma \text{max}_aQ^*(s',a))$

Idea Store $Q(\text{state}, \text{action})$ in a table, update this as in asynchronous value iteration, but using experience, (empirical probabilities and rewards).
- Suppose the agent has an experience $(s, a, r, s')$
- This provides one piece of data to update $Q(s,a)$
- An experience $(s, a, r, s')$ provides a new estimate for the value of $Q^*(s,a)$
  
  $\text{TD Target}=r+\gamma\ max_{a'}Q(s',a')$
  
  $\hat{Q}^*(s,a)=r+\gamma\text{max}_{a'}Q(s',a') = r\text{eward} + \text{discounted future Q(s,a) value}$
- Note: We potentially know nothing about the environment, but we can initialise the Q-values to all zeros.
- This can be used in the Temporal Differences formula:
  
  $\hat{Q}(s,a)\leftarrow \hat{Q}(s,a)+\alpha(r+\gamma\text{max}_{a'}\hat{Q}(s',a')-\hat{Q}(s,a))$
- Therefore, based on empirical values, we can update our Q value.

4.2.1 - Q-Learning Pseudocode

Iteratively estimate the table $\hat{Q}(s,a)$ from experience:
Initialise $\hat{Q}(s,a)$ arbitrarily (e.g. all zeros)
Observe the current state s
Repeat for each episode, until convergence: {Get the policy}
- Select and carry out an action a
- Observe reward r and state s'
- $\hat{Q}(s,a)\leftarrow \hat{Q}(s,a)+\alpha(r+\gamma\text{max}_{a'}\hat{Q}(s',a')-\hat{Q}(s,a))$
- $s\leftarrow s'$
For each state s {Extract the policy - find the action that optimises the reward for each state}
- $\pi(s)=argmax_a\hat{Q}(s,a)$ # Greedy approach
- return $\pi, \hat{Q}$

🧠 This is very similar to the Bellman equations - like Value Iteration (and somewhat like policy iteration)

4.2.2 - Q-Learning Example

Initialise $\hat{Q}(s,a)$ arbitrarily (e.g. all zeros)
Observe the current state s
Repeat for each episode, until convergence:
- Select and carry out an action a
- Observe reward r and state s'
- $\hat{Q}(s,a)\leftarrow \hat{Q}(s,a)+\alpha(r+\gamma\text{max}_{a'}\hat{Q}(s',a')-\hat{Q}(s,a))$ *
- $s\leftarrow s'$
For each state s
- $\pi(s)=argmax_a\hat{Q}(s,a)$ * # Greedy approach
- return $\pi, \hat{Q}$

Untitled

On squares with an arrow exiting the grid world, the only action available to the agent is to exit and receive the reward shown.

On any other square, its actions are Left or Right

If the agent is in a square with a square below it, its action will succeed with probability p and with probability $1-p$ it will fail and the agent will fall into a trap

In other squares, it always moves successfully.

In this step, the agent is updating its estimates from the observations gained from the current episode

In this step, we could use some multi-armed bandit techniques

4.2.3 - Q-Learning Example - Grid World

Using the Q-learning algorithm, the agent quickly determines which actions are the best to perform in each state.

4.3 - Properties of Q-Learning

Q-learning converges to an optimal policy, no matter what the agent does, as long as it tries each action in each state enough times.
But, what should the agent do? Use an exploration strategy to:
- Exploit When in state s, select an action that maximises $Q(s,a)$
- Explore Select another action (either arbitrarily or according to some probability)
- Greedy-in-the-limit-with-finite-exploration (GLIE) policies
- Choose these strategies from the Multi-Armed Bandits discussed earlier.

4.4 - Problems with Q-Learning

It does one backup between each experience - is this appropriate for a robot interacting with the real world?
- An agent might be able to make better use of the data by doing multi-step backups or building a model and using MDP methods to determine the optimal policy
- Perform multi-step backups as in TD
It learn separately for each state, might be able to learn better over collections of states (feature-based RL)

5.0 - SARSA (State-Action-Reward-State-Action)

5.1 - On-Policy Learning

Q-Learning does off-policy learning - it learns the value of an optimal policy, no matter what it does
This could be bad if the exploration policy is dangerous
On-learning policy learns the value of the policy being followed.
- e.g. act greedily 80% of the time and act randomly 20% of the time
Why? If the agent is actually going to explore, it may be better to optimise the policy it is going to do
SARSA uses the experience $(s, a, r, s', {\color{529CCA}a'})$ to update $Q(s,a)$
- That is, SARSA uses the actual action that was performed to update the policy.
- This leads to better usage of data collected - i.e. less costly mistakes made - learns from mistakes

5.2 - SARSA Pseudocode

Initialise $\hat{Q}(s,a)$ arbitrarily (e.g. all zeros)
Observe the current state s
Select an action a
Repeat for each episode, until convergence:
- carry out an action a
- Observe reward r and state s'
- $\hat{Q}(s,a)\leftarrow \hat{Q}(s,a)+\alpha(r+\gamma\hat{Q}(s',a')-\hat{Q}(s,a))$ # here, modifies based on the action that was perf
- $s\leftarrow s'$
- $a\leftarrow a'$ # Here we're also looking at the action performed - the agent remembers the action that was performed.

In this, use a MAB or Epsilon-Greedy policy to balance exploration and exploitation