1.0 - Recap - Utility and Rewards over Time

1.1 - Utility and Time

Would you prefer $1000 today or $1000 next year?
What price would you pay now to have an eternity of happiness?
How can you trade off pleasures today with pleasures in the future?

How would you compare the following sequences of rewards per week?

$1,000,000, $0, $0, $0, ...

$1,000, $1,000, $1,000, ...

$1,000, $0, $0, ...

$1, $1, $1, $1, ...

$1, $2, $3, $4, $5, ....

It depends on how you frame it? - Is it an infinite sequence? - Are we taking the average / median / sum?

1.2 - Rewards and Values

Suppose the agent receives a sequence of rewards $r_1, r_2, ...$ in time. What utility should be assigned? Return or value?
The following show examples of what we could use as our reward.
- $\text{Total Reward (V)}=\sum^\infty_{i=1}r_i$
  - Total Reward may sum to infinity for various values
- $\text{Average Reward (V)}=\lim_{n\rightarrow\infty}(r_1+...+r_n)/n$
  - Might not always represent something that we consider to be better
- $\text{Discounted Return (V)}=r_1+\gamma y_2 + \gamma^2y_3+\gamma^3y_4+...$
  - where $\gamma$ is the discounted factor, $0\le\gamma\le1$
  - Want comparison to be finite - use $\gamma$ to scale the reward.
  - As $\lim_{n\rightarrow\infty}\gamma^n=0$ - we value immediate rewards over future rewards
Powers of $\gamma$ used as a result of the way interest is calculated in finance

1.3 - Properties of Discounted Rewards

The discounted return for rewards $r_1, r_2, ....$ is

$V=r_1+\gamma r_2+\gamma^2r_3+\gamma^3r_4+...$

$=r_1+\gamma(r_2+\gamma(r_3+\gamma(r_4+...)))$
If value $V_t$ is the value obtained from time step $t$

$V_t=r_t+\gamma V_{t+1}$
How is the infinite future valued compared to immediate rewards?

$1+\gamma+\gamma^2+\gamma^3+... ={1\over{1-\gamma}}$

$\therefore \frac{\text{minimum reward}}{1-\gamma}\le V_t \le \frac{\text{maximum reward}}{1-\gamma}$
We can approximate $V$with the first $k$ terms, with error:

$V-(r_1+\gamma r_2+\cdot\cdot\cdot+y^{k-1}r_k)=\gamma^kV_k+1$

1.4 - Allais Paradox (1953)

What would you prefer?
1. $1M
  
  $\color{#999}\text{Expected Utility}=\$1,000,000\times 1.0=1,000,000$
2. lottery $[0.10:\$25\text{M}, 0.89:\$1\text{M}, 0.01:\$0]$ (
  
  $\color{#999}\text{Expected Utility}=0.1\times\$25,000,000+0.89\times\$1,000,000+0.01\times0=3,390,000$
What would you prefer?
1. lottery $[0.11:\$1\text{M}, 0.89:\$0]$
  
  $\color{#999}\text{Expected Utility}=0.11\times\$1,000,000+0.89\times \$0=110000$
2. lottery $[0.10:\$2.5\text{M}, 0.9:\$0]$
  
  $\color{#999}\text{Expected Utility}=0.10\times\$2,500,000+0.9\times\$0=250,000$

It is inconsistent with the axioms of preferences to have $A\succ B, D \succ C$

$\text{A,C: lottery } [0.11:\$1\text{M},0.89:X]$

$\text{B,D: lottery }[0.10: \$25\text{M},0.01:\$0, 0.89:X]$

1.5 - Framing Effects - Tversky and Kahneman

A disease is expected to kill 600 people. Two alternative programs have been proposed:
- Program A: 200 people will be saved
- Program B: $1\over3$ → 600 people will be saved
```
                  $2\over3$→ no one will be saved
```
A disease is expected to kill 600 people. Two alternative programs have been proposed:
- Program C: 400 people will die
- Program D: $1\over3$ → No one will die
```
                  $2\over3$→ 600 will die
```
Programs are the same, just different framing of the same problem

2.0 - Decision-Theoretic Planning

2.1 - Agents as Processes

Agents carry out actions
- Infinite horizon Forever
- Indefinite horizon Until some stopping criteria is met
- Finite horizon Finite and fixed number of steps

2.2 - Decision-Theoretic Planning

What should an agent do when:
- It gets rewards (and penalties) and tries to maximise its rewards received?
- Actions can be stochastic (non-deterministic); the outcome of an action can't be fully predicted

2.3 - Initial Assumptions for Decision-Theoretic Planning

Flat instead of modular or hierarchical (everything in this course will be flat)
Explicit states instead of features, individuals or relations
Indefinite / infinite stages instead of static or finite stages
Fully observable instead of partially observable (we can exactly sense the state we are in)
Stochastic dynamics instead of deterministic dynamics (can't perform actions deterministically)
Complex preferences instead of goals
Single agent interaction instead of multiple agents
Knowledge is given instead of knowledge is learned
Perfect rationality instead of bounded rationality

3.0 - Markov Decision Processes

A framework to find the best sequence of actions to perform when the outcome of each action is non-deterministic
The basis for Reinforcement Learning
Used to solve games like Tic-Tac-Toe, Chess, Go, in control systems for traffic systems and navigation systems.

3.1 - World State

The world state is the information such that if the agent knew the world state, no information about the past is relevant to the future - Markovian assumption
$S_k$ is a state at time $k$ , and $A_k$ is the action at time $k$

$P(S_{t+1}|S_0,A_0,...,S_t,A_t)=P(S_{t+1}|S_t,A_t)$
- The next state is only dependent on the current state - i.e. memoryless property.
$P(s'|s,a)$ is the probability that the agent will be in state $s'$ immediately after performing action $a$ in state $s$
- The state captures all relevant information from the history
- The state is a sufficient statistic of the future
The dynamics is stationary if the probability distribution is the same for each time point.

3.2 - MDPs vs Markov Chains

A Markov Decision Process augments a Markov Chain with actions and values:
When you perform an action on a given state, you get a new state and some associated rewards.

A Markov Decision Process consists of:
- Set of States (S)
- Set of Actions (A)
- Transition Function $P(S_{t+1}|S_t,A_t)$ for stochastic / probabilistic environment
- Reward Function $R(S_tA_t,S_{t+1})$ specifies the reward at time t
  - Sometimes is a random variable.
  - $R(s,a,s')$ is the expected reward received when the agent is in state $s$ , does action $a$ and ends up in state $s'$
  - Sometimes we use $R(s,a)=\sum_{s'}P(s'|s,a)\times R(s,a,s')$
- $\gamma$ is a discount factor
- An MDP's objective is to maximise the expected value of the rewards - $\Bbb{E}[\sum^T_{t=0}\gamma^tR(s_t,a_t)]$
  - Similar to a Utility function that we try to optimise

3.2.1 - MDP Examples - To Exercise or Not || Simple Grid World

Example - To Exercise or Not States = {fit, unfit} Actions = {exercise, relax} Dynamics: $\ \ \ \ \ \ \ \ \ \def\arraystretch{1.4}\begin{array}{|c|c|c|}\hline\text{\textbf{State}} & \text{\textbf{Action}} & \text{\textbf{p(fit|State, Action)}}\\\hline\text{fit} & \text{exercise} & \text{0.99}\\\hline\text{fit} & \text{relax} & \text{0.7}\\\hline\text{unfit} & \text{exercise} & \text{0.2}\\\hline\text{unfit} & \text{relax} & \text{0.0}\\\hline\end{array} % Table generated using https://www.notionsimpletable.com/ %$
Reward (does not depend on resulting state)
$\ \ \ \ \ \ \ \ \ \def\arraystretch{1.4}\begin{array}{|l|l|l|}\hline\textsf{\textbf{State}} & \textsf{\textbf{Action}} & \textsf{\textbf{Reward}}\\\hline\textsf{fit} & \textsf{exercise} & \textsf{8}\\\hline\textsf{fit} & \textsf{relax} & \textsf{10}\\\hline\textsf{unfit} & \textsf{exercise} & \textsf{0}\\\hline\textsf{unfit} & \textsf{relax} & \textsf{5}\\\hline\end{array} % Table generated using https://www.notionsimpletable.com/ %$

In the long run, we can see that being fit is going to be beneficial (lead to greater reward)

Example - Simple Grid World

The agent can be in any grid cell

In certain cells, the agent is given positive (or negative) reward.

If you collide with the walls there is a reward of -1

States: 100 states corresponding to the position of the agent / robot

Actions: Up, Down, Left, Right

Transition: Robot goes in the commanded direction with probability 0.7 and one of the other directions with probability 0.1

Rewards: If it crashes into an outside wall, it remains in its current position and has a reward of -1. For special reward states - the agent gets the reward when leaving the state.

3.3 - Planning Horizons

The planning horizon is how far ahead the planner looks to make a decision
The robot gets flung to one of the corners at random after leaving a positive (+10 or +3) reward state
- The process never halts
- Infinite planning horizon
The robot gets +10 or +3 in the state, then it stays there getting no reward, or it is left with only a special action exit and the episode ends - these are absorbing states
- The robot will eventually reach an absorbing state - non-zero probability of getting to the +10 or +3 reward state from other states
- Indefinite horizon

3.4 - Information Availability

What information is available when the agent decides what to do?
- Fully-observable MDP / FOMDP - The agent gets to observe $S_t$ when deciding on action $A_t$
- Partially-observable MDP / POMDP - the agent has some noisy sensor of the state
  - It is a mix of a hidden Markov model and MDP - It need to remember (some function of) its sensing and acting history.

3.5 - Policy

A policy is a sequence of actions, taken to move from each state to the next state over the whole time horizon
A stationary policy is a function or a map

$\pi:S\rightarrow A$
Given a state $s$ , $\pi(s)$ specifies what action the agent who is following $\pi$ will do.
An optimal policy, usually denoted $\pi^*$ is one with the maximum expected discount reward

$max_\pi\Bbb{E}[\sum_{t=0}\gamma^tR(s_t,\pi(s_t)]$
- Where $\pi(t)$ is the action taken at time $t$
- Here, we replace the action with the action determined by the policy.
- We want to find optimal $\pi(s)$ so that we can optimise our reward.
For a fully observable MDP with stationary dynamics and rewards with infinite or definite horizon, there is always an optimal stationary policy.

3.5.1 - MDP Example - To Exercise or Not

To Exercise or Not || Simple Grid World

Example - To Exercise or Not States = {fit, unfit} Actions = {exercise, relax} Dynamics: $\ \ \ \ \ \ \ \ \ \def\arraystretch{1.4}\begin{array}{|c|c|c|}\hline\text{\textbf{State}} & \text{\textbf{Action}} & \text{\textbf{p(fit|State, Action)}}\\\hline\text{fit} & \text{exercise} & \text{0.99}\\\hline\text{fit} & \text{relax} & \text{0.7}\\\hline\text{unfit} & \text{exercise} & \text{0.2}\\\hline\text{unfit} & \text{relax} & \text{0.0}\\\hline\end{array} % Table generated using https://www.notionsimpletable.com/ %$
Reward (does not depend on resulting state)
$\ \ \ \ \ \ \ \ \ \def\arraystretch{1.4}\begin{array}{|l|l|l|}\hline\textsf{\textbf{State}} & \textsf{\textbf{Action}} & \textsf{\textbf{Reward}}\\\hline\textsf{fit} & \textsf{exercise} & \textsf{8}\\\hline\textsf{fit} & \textsf{relax} & \textsf{10}\\\hline\textsf{unfit} & \textsf{exercise} & \textsf{0}\\\hline\textsf{unfit} & \textsf{relax} & \textsf{5}\\\hline\end{array} % Table generated using https://www.notionsimpletable.com/ %$

In the long run, we can see that being fit is going to be beneficial (lead to greater reward)

Example - Simple Grid World

The agent can be in any grid cell

In certain cells, the agent is given positive (or negative) reward.

If you collide with the walls there is a reward of -1

States: 100 states corresponding to the position of the agent / robot

Actions: Up, Down, Left, Right

Transition: Robot goes in the commanded direction with probability 0.7 and one of the other directions with probability 0.1

Rewards: If it crashes into an outside wall, it remains in its current position and has a reward of -1. For special reward states - the agent gets the reward when leaving the state.

3.5.2 - Solutions to MDP Problems

For the simple grid world navigation:
- A policy, or mapping from states to actions ( $\pi:S\rightarrow A$ ) could be given as shown in the figure to the right
A policy $\pi(s)$ tells us what action the agent should perform for each state
The optimal policy $\pi^*(s)$ tells us what the best action the agent should perform for each state.

3.6 - Discounted Rewards

How much you value immediate rewards vs future rewards.

$V=r_1+\gamma r_2 + \gamma^2 r_3+\cdot\cdot\cdot+\gamma^{k-1}r_k+\cdot\cdot\cdot$

$V=\sum^\infty_{k=1}\gamma^{k-1}r_k$
The discount factor $\gamma\in[0,1)$ gives the present value of future rewards.
The value of receiving reward $r$ after $k+1$ time steps is $\gamma^kr$
This values immediate reward above delayed reward
- $\gamma$ close to 0 leads to myopic evaluation (short-sighted)
- $\gamma$ close to 1 leads to far sighted evaluation

3.7 - Value of a Policy

We first approach MDPs using a recursive reformulation of the objective called a value function
- The value function of an MDP, $V^\pi(s)$ is the expected future reward of following an (arbitrary) policy $\pi$ starting from state $s$ , given by:
  
  $V^\pi(s)=\sum_{s'\in S}{\color{FF7369}P(s'|\pi(s),s)}[{\color{4DAB9A}R(s,\pi(s), s')}+{\color{529CCA}\gamma V^\pi(s')}]$
  Where the policy $\pi(s)$ determines the action taken in state s
  
  $\color{#FF7369}P(s'|\pi(s),s)$ is our transition function for our stochastic world.
  
  $\color{#4DAB9A}R(s,\pi(s),s')$ is our reward function
  
  $\color{#529CCA}\gamma V^\pi(s')$ is our discounted future value function
- Here, we have dropped the time index as it is redundant, but note that $a_t=\pi(s_t)$
- Note that this is a recursive definition - the value of $V^\pi(s)$ depends on the value of $V^\pi(s')$
Given a policy $\pi$
- The Q-function represents the value of choosing an action and then following policy $\pi$ in every subserquent state.
- $Q^\pi(s,a),$ where $a$ is an action and $s$ is a state, is the expected value of doing $a$ in state $s$ , then following policy $\pi$
$Q^\pi$ and $V^\pi$ can be defined mutually recursively:

$Q^\pi(s,{\color{529CCA}a})=\sum_{s'}P(s'|a,s)(R(s,a,s')+\gamma V^\pi(s'))\\V^\pi(s)=Q(s,{\color{#4DAB9A}\pi(s)})$
Note: When computing the future value, choose the best action using the policy $\pi$ instead of for some arbitrary action $\color{#529CCA}a$

Colours obtained using this database

3.7.1 - Computing the Value of a Policy

Let $v^\pi\in\R^{|S|}$ be a vector of values for each state, and $r\in\R^{|S|}$ be a vector of rewards for each state.
Let $P^\pi\in\R^{|S|\times|S|}$ be a matrix containing probabilities for each transition under policy $\pi$ , where:

$P^\pi_{ij}=P(s_{t+1}=j|s_t=i,a_t=\pi(s_t))$
- This is better than what we had before (solving recursively) as we can now solve using linear algebra.
Then the value function can be written in vector form as:

$v^\pi=r+\gamma P^\pi v^\pi$
We can solve this using linear algebra

$\Rightarrow(I-\gamma P^\pi)v^\pi=r\\\Rightarrow v^\pi=(I-\gamma P^\pi)^{-1}r$
i.e. computing value for a policy requires solving a linear system.
An optimal policy, $\pi^*$ expressed in terms of the value function is one that satisfies Bellman's optimality condition (1957)

$V^*(s)=max_a\{R(s,a,s')+\gamma max_\pi V^\pi(s')\}$
$V^*(s)$ is the expected value of following the optimal policy in state $s$ .
Similarly, $Q^*(s,a)$ , where $a$ is an action and $s$ is a state, is the expected value of doing action $a$ in state $s$ then following the optimal policy.
- Note that the function $Q^*(s,a)$ is called the Q-value or sometimes the action-value function
$Q^*$ and $V^*$ can be defined mutually recursively:

$Q*(s,a)=\sum_{s'}P(s'|a,s)(R(s,a,s')+\gamma V^*(s'))\\ V^*(s)=max_a\ Q(s,a)\\ \pi^*(s)=arg\ max_a Q(s,a)$

4.0 - Value Iteration (Offline Method 1 of 2)

Given the following Grid World, Value Iteration can be performed to determine the optimal path / sequence of actions.

Shown below are the values of each state $\color{529CCA}V^*(s)$ (as shown on the left) and the $\color{4DAB9A}Q^*(s,a)$ value after performing an action.

Note here that in each state, $V^*(s)=max(Q^*(s,a))$

Grid-World with two terminal states {+1, -1} and a series of actions that are possible.

Value Iteration - V-Value after 100 iterations

Value Iteration - Q-Value after 100 iterations

4.1 - Value Iteration Mechanics

Let $V_k$ be a k-step lookahead value function (expected reward, up to time step $k$ )
Idea Given an estimate of the k-step lookahead value function, determine the k+1 step lookahead value function

Set $V_0$ arbitrarily, e.g.:

$\hat{V}(s)\leftarrow0$
Compute $V_{i+1}$ from $V_i$ - loop for all states $s$

$V_{i+1}(s)=\text{max}_a \sum_{s'}P(s'|a,s)\{R(s,a,s')+\gamma V_i(s')\}$
Once the values converge, recover the best policy from the current value function estimate

$\text{arg max}_a \sum_{s'}P(s'|a,s)\{R(s,a,s')+\gamma \hat{V}(s')\}$

i.e. pick the action that maximises the value of the $Q(s,a)$ function

Theorem: There is a unique function $V^*$ that satisfies these functions
If we know $V^*$ , the optimal policy can be generated easily
Guaranteed to converge to $V^*$
No guarantee we'll reach optimal solution in finite amount of time, but in practice this converges exponentially fast (in $k$ ) to the optimal value function
The error reduces proportionally to $\frac{\gamma^k}{1-\gamma}$

4.2 - Value Iteration Example - Grid World

Step 1 - Initialisation

After the first iteration of value iteration, we see that we have some initial results and observations
Our agent tries to move in the direction that optimises the value of the function $V^*(s)$ .
- We see that for the square directly adjacent to the state [1.00] the value iteration model indicates that that is the best action is to move toward that state.
- Additionally, for the squares directly adjacent to the [-1.00] square, the value iteration model indicates that the best action is to move away from that state.
The Q-Values also indicate that moving toward the [1.00] node is the best action when in an adjacent state and that moving away from the [-1.00] state is the best action when in an adjacent state.
However, when providing this information to the agent, it doesn't have enough information to efficiently solve the puzzle.
Note here that the value iteration step was performed completely offline (before the agent started interacting with the environment).

Step 10 - 10 Iterations

When we increase the number of iterations to 10, we immediately observe that the values have started to converge to the final policy (we can observe that the arrows seem to be correctly indicating the optimal solution for each state).
The Q-Values of the model further indicate the weights for each action in each state - these values determine which action is most beneficial to perform in each state.
When the agent is run on this environment, provided with the following information, it solves the environment in the most efficient way possible - it is an optimal solution.

4.2.1 - Grid World Environment - Working

Suppose we have a grid world environment, with a layout as shown in the figure on the right.
- Let $\gamma=0.9$
- $\text{actions}=\{\text{up, down, left, right}\}$
  
  → Move successfully with $p=0.7$
  
  → Otherwise, move perpendicular to intended direction (each direction $p=0.15)$
- If the agent hits a wall, it says exactly where it is (state does not change)

{\color{529CCA}V(s)}=\sum_{s'}{\color{9A6DD7}P(s'|\pi(s))} [{\color{4DAB9A}R(s,\pi(s),s')}+{\color{FFA344}\gamma V^\pi(s')}] \\V_{i+1}=max_a \sum_{s'}{\color{9A6DD7}P(s'|a,s)} \{{\color{4DAB9A}R(s,a,s')} + \gamma {\color{529CCA}V_i(s')}\}

Note that the value of the next time instant $V_{i+1}$ depends on the value of the previous time instant $\color{529CCA}V_i$
Note that this is the transition function, sometimes denoted $T(s,a,s')$
Now, we update the cells in the grid world with the probabilities from the first iteration (looking one step ahead).

Suppose we enumerate the cells as shown in the right
When looking at the cells from which the agent can move from and have a change in reward by moving from one step, note that the only two cells that will be updated are indicated in yellow - $\color{FFA344}(2,3)\ \&\ (3,3)$
From this, and the information about the determinism of moving into a particular state, we can create a state transition table, which describes the probability of ending up in a particular state, when moving from an initial state.

$\def\arraystretch{1.4}\begin{array}{|l|l|l|l|l|}\hline\textsf{\textbf{Start (s) / End( s')}} & \textsf{\textbf{(2,4)}} & \textsf{\textbf{(3,4)}} & \textsf{\textbf{(1,4)}} & \textsf{\textbf{(3,3)}}\\\hline\textsf{(2,3)} & \textsf{0.7} & \textsf{0.15} & \textsf{0.15} & \textsf{}\\\hline\textsf{(3,3)} & \textsf{0.15} & \textsf{0.7} & \textsf{} & \textsf{0.15}\\\hline\end{array}$ (Transition Table for action RIGHT)

As shown in the formula $V(s)=...$ before, we want to multiply the value of the transition function (which can be obtained from this transition function) with the reward and discounted future value ( $\gamma \times V_i(s')$ ).
For State (3,3), we have:

$\{(3,3)\rightarrow(2,4)\}: \ \ (0.15\times-1)\\ \{(3,3)\rightarrow(3,4)\}: \ \ (0.7 \times 1) \\ \{(3,3)\rightarrow(3,3)\}: \ \ (0.15 \times 0) \\ V_{i+1}=\gamma \times \sum_{right} = (0.15\times-1)+(0.7\times1)=0.7-0.15\\=0.9 \times (0.55 \text{ [For moving in the Right] direction}\\+\ \text{\{Repeat for the other actions\}}\\...\\...\\)$

4.3 - Asynchronous Value Iteration

This was quite a computationally expensive operation
The agent doesn't need to sweep through all the states, but can update the value functions for each state individually
Do update assignments to the states in any order we want, even randomly
This converges to the optimal value functions, if each state is visited infinitely often in the limit
It can either store $V[s]$ or $Q[s,a]$ .
- V[s] is the value for the state
- Q[s,a] is the value for the action, state pair.

4.3.1 - Asynchronous Value Iteration

Storing V[s]

Repeat forever:
1. Select the state $\color{4DAB9A}s$
2. $V[s]\leftarrow max_a\sum_{s'}P(s'|s,a)\{R(s,a,s')+\gamma V[s']\}$

Storing Q[s, a]

Repeat forever:
1. Select the state $\color{4DAB9A}s$ and action $\color{4DAB9A}a$
2. $Q[s,a]\leftarrow\sum_{s'}P(s'|s,a)(R(s,a,s')+\gamma\times max_a Q[s',a'])$

5.0 - Policy Iteration (Offline Method 2 of 2)

5.1 - Policy Iteration

Set $\pi_0$ arbitrarily, let $i=0$
Repeat:
1. Solve for $V^{\pi i}$ (s) or $Q^{\pi i}(s,a)$
  
  $V^{\pi i}(S)=\sum_{s'\in S}P(s'|\pi_i(s),s)[R(s,\pi_i(s),s')+\gamma V^{\pi i}(s')]\ \ \ \ \ \ \forall s\in S$
2. Update the policy
  
  $\pi_{i+1}(s)\leftarrow \text{arg max}_a \sum_{s'\in S} P(s'|a,s)[R(s,a,s')+\gamma V^{\pi i}(s')]$
3. $i=i+1$
Until $\pi_i(s)=\pi_{i-1}(s)$
We update the policy in every step, as opposed to running the model until convergence then updating the policy
- Leads to less computational complexity in some cases.
Solving $\color{529CCA}V^{\pi i}(s)$ means finding a solution to a set of $\color{4DAB9A}|S|$ linear equations with $\color{4DAB9A}|S|$ unknowns as shown earlier.
- Note The Week 7 lecture slides state that there are $|S|\times|A|$ linear equations with $|S|\times|A|$ unknowns. This is incorrect as clarified in the Week 8 Lecture thread.

5.1.1 - Policy Iteration Example - Grid World Environment

Observe that in this example, we take much fewer steps as compared to Value Iteration.

Set the policy to an arbitrary value for every state.
Solve for $V^{\pi i}$ for each state
Choose the action that will optimise the value for each state and update the policy.
Repeat several times until the policy converges