1.0 - Review of MDPs

1.1 Markov Decision Processes

1.2 - Policies

1.2.2 - Value of a Policy

1.2.3 Value of the Optimal Policy

1.3 - Value Iteration

  1. Set V0V_0 arbitrarily, e.g.:

    V^(s)0\hat{V}(s)\leftarrow0
  2. Compute Vi+1V_{i+1} from ViV_i - loop for all states ss

    Vi+1(s)=maxasP(sa,s){R(s,a,s)+γVi(s)}V_{i+1}(s)=\text{max}_a \sum_{s'}P(s'|a,s)\{R(s,a,s')+\gamma V_i(s')\}
  3. Once the values converge, recover the best policy from the current value function estimate

    arg maxasP(sa,s){R(s,a,s)+γV^(s)}\text{arg max}_a \sum_{s'}P(s'|a,s)\{R(s,a,s')+\gamma \hat{V}(s')\}

i.e. pick the action that maximises the value of the Q(s,a)Q(s,a) function

1.4 - Policy Iteration

1.5 - Value Iteration vs Policy Iteration

Value Iteration

Policy Iteration

Both are Dynamic Programming methods (Dynamic Programming methods - where you decompose the problem into small parts that can be solved recursively)

1.6 - Modified Policy Iteration

1.7 - Special Case: Finite Horizon MDPs

1.8 - Summary of Q-Values, V-Values, π\pi and R

Q(s,a)=sP(sa,s)(R(s,a,s)+γV(s))V(s)=maxa Q(s,a)π(s)=argmaxa Q(s,a)Q^*(s,a)=\sum_{s'}P(s'|a,s)(R(s,a,s)+\gamma V^*(s'))\\ V^*(s)=max_a\ Q(s,a)\\ \pi^*(s)=argmax_a\ Q(s,a)

Let

R(s,a)=sP(sa,s)R(s,a,s)\color{4DAB9A}R(s,a)=\sum_{s'}P(s'|a,s)R(s,a,s')

Then:

Q(s,a)=s(P(sa,s)R(s,a,s)×P(sa,s)γV(s))Q(s,a)=R(s,a)+γsP(sa,s)V(s)Q^*(s,a)={\color{4DAB9A}\sum_{s'}}({\color{4DAB9A}P(s'|a,s)R(s,a,s)}\times P(s'|a,s)\gamma V^*(s'))\\ \Rightarrow Q^*(s,a)={\color{4DAB9A}R(s,a)}+\gamma\sum_{s'}P(s'|a,s)V^*(s')

2.0 - Real Time Dynamic Programming (RTDP)

2.1 - Problem: Large State Space

2.1.1 - Policy Representation

2.2 - Computing the Policy Online

2.3 - Real Time Dynamic Programming (RTDP)

RTDP(Initial State s0,Goal State G):RTDP({\color{529CCA}\text{Initial State }s_0, \text{Goal State } G}):

2.4 - Labelled RTDP

Labelled RTDP is an improvement on RTDP

3.0 - Monte Carlo Tree Search (MCTS)

V(s)=maxasP(sa,s)[R(s,a,s)+γV(s)]V^*(s)=max_a\sum_{s'}P(s'|a,s)[R(s,a,s')+\gamma{\color{FF7369}V^*(s')}]

3.1 - Monte Carlo Methods

3.2 - Monte Carlo Tree Search (MCTS)

3.2.1 - Model-Based Monte Carlo

3.3 - MCTS Example of Steps

The transition function is treated as a black box in MCTS

3.3.1 - MCTS for MDP

3.4 - Commonly used MCTS for MDP

  1. Build the search tree based on the outcomes of the simulated plays.
  2. Iterate over the four main components:
    1. Selection: Choose the best path
    2. Expansion: When a terminal node is reached, add a child node
    3. Simulation: Simulate from the newly added node n, to estimate its value
    4. Backpropagation: Update the value of the nodes visited in this iteration.

3.4.1 - Node Selection

3.4.2 - Simulation

3.4.3 - Backpropagation

4.0 - Example of MCTS

5.0 - Value Function Approximation (VFA)

5.1 - Large Scale Problems

5.2 - MDPs and Reinforcement Learning with Features

Express the value function as a function of the features

5.3 - Linear Value Function Approximation

Learn a reward/value function as a linear combination of features

5.4 - Q-Learning with Linear Value Function Approximation

Given $\gamma =$ discount factor and $\eta=$ step size Assign weights $\bar{w}=(w_0, ..., w_n)$ arbitrarily Observe the current state, $s$ **repeat** each episode, until convergence: select and carry out action a observe reward r and state s' select action a' (using a policy based on $Q_{\bar{w}}$ which is a Q-table indexed by features) let $\delta=r+\gamma Q_w(s', a')-Q_w(s,a)$ for $i=0$ to n - update the weights as follows. $w_i=w_i+\eta\delta F_i(s,a)$ $s\leftarrow s'$

Intuition: This is performing gradient descent across all features, effectively adjusting feature weights to reduce the difference between the sampled value and the estimated expected value.

5.5 - Advantages and Disadvantages of VFAs

Advantages

Disadvantages

5.6 - General VFAs

Next Time: