1.0 - Overview of Module 4 - Learning to Act

1.1 - Expectations

At the end of the class you should be able to:

1.2 - Reinforcement Learning

1.3 - Assumptions on Environment for Reinforcement Learning (Module 4)

1.4 - Examples of Reinforcement Learning

1.5 - Experiences

1.6.1 - Reinforcement Learning Approaches

1.7 - Reinforcement Learning - Main Approaches

Approach One Learn a model consisting of (1) state transition function P(sa,s)P(s'|a,s), (2) reward function R(s,a,s)R(s,a,s') and solve this as a MDP.

Approach Two Learn Q(s,a)Q^*(s,a) and use this to guide the action chosen - use this function to determine how good the (s,a)(s,a) pair is.

Approach Three Search through a space of policies (controllers)

In all of these cases, we face the problem of exploration vs exploitation

2.0 - Exploration vs Exploitation - Multi-Armed Bandits

2.1 - Exploration vs Exploitation

2.2 - Multi-Armed Bandit Problem

🧠 Assumption: 1. The choice of several arms / machines 2. Each arm pull is independent of other arm pulls 3. Each arm has a fixed, unknown average payoff

Which arm has the best average payoff? How do we maximise the sum of rewards over time?

  1. Consider a row of three poker machines.

    R(win)=1R(\text{win})=1 for all machines

    P(A,win)=0.6,P(B,win)=0.55,P(C,win)=0.4P(A, \text{win})=0.6, P(B, \text{win})=0.55, P(C, \text{win})=0.4

    The expected utility theory tells us that A is the best arm, but we don't know that!

2.4 - Exploration Strategies

2.4.1 - Epsilon-Greedy Exploration

🧠 If we are just considering exploitation and exploration, we *do the thing that rewards you most of the time, and sample other paths with some small probability*

2.4.2 - Upper Confidence Bound

UCB1i=vi^+cln(N)ni UCB1_i=\hat{v_i}+c\sqrt{\frac{ln(N)}{n_i}}

vi^\hat{v_i} is the current value (mean) estimate for the arm ii

C is the tunable parameter

N is the total number of arm pulls

nin_i is the number of times arm ii has been pilled

vi^\hat{v_i} is the exploitation term

cln(N)nic\sqrt{\frac{ln(N)}{n_i}} is the exploration term


UCB1i=vi^+cln(N)ni UCB1_i=\hat{v_i}+c\sqrt{\frac{ln(N)}{n_i}}

3.0 - Model-Based Reinforcement Learning

3.1 - Asynchronous Value Iteration for MDPs (Storing Q(s,a))

🧠 Initialise a table of Q values, $Q(s,a)$ arbitrarily Repeat forever:
  1. Select state s, action a
  2. Q(s,a)sP(ss,a)(R(s,a,s)+γmaxaQ(s,a))Q(s,a)\leftarrow\sum_{s'} P(s'|s,a)(R(s,a,s')+\gamma max_{a'}Q(s',a'))

3.1.1 - Unknown Transition and Reward Function

R^(s)=i=0mI(si=s)rii=0mI(si=s)=# times in state s×immediate reward# times in state s=average reward for state s \hat{R}(s)=\frac{\sum^m_{i=0}\Iota(s_i=s)r_i}{\sum^m_{i=0}\Iota(s_i=s)}=\frac{\text{\# times in state }s \times \text{immediate reward}}{\text{\# times in state }s}=\text{average reward for state }s

3.3.2 - Model-Based Reinforcement Learning

4.0 - Q-Learning

4.1 - Temporal Differences

v1+...+vk1=Ak1×(k1){\color{529CCA}v_1+...+v_{k-1}}=A_{k-1}\times(k-1)

4.2 - Q-Learning Implementation

4.2.1 - Q-Learning Pseudocode

🧠 This is very similar to the Bellman equations - like Value Iteration (and somewhat like policy iteration)

4.2.2 - Q-Learning Example

Untitled

On squares with an arrow exiting the grid world, the only action available to the agent is to exit and receive the reward shown.

On any other square, its actions are Left or Right

If the agent is in a square with a square below it, its action will succeed with probability p and with probability 1p1-p it will fail and the agent will fall into a trap

In other squares, it always moves successfully.

In this step, the agent is updating its estimates from the observations gained from the current episode

In this step, we could use some multi-armed bandit techniques

4.2.3 - Q-Learning Example - Grid World

4.3 - Properties of Q-Learning

4.4 - Problems with Q-Learning

5.0 - SARSA (State-Action-Reward-State-Action)

5.1 - On-Policy Learning

5.2 - SARSA Pseudocode

In this, use a MAB or Epsilon-Greedy policy to balance exploration and exploitation

5.2.1 - Q-Learning vs SARSA Pseudocode