1.0 - Recap - Utility and Rewards over Time

1.1 - Utility and Time

How would you compare the following sequences of rewards per week?
  1. $1,000,000, $0, $0, $0, ...
  2. $1,000, $1,000, $1,000, ...
  3. $1,000, $0, $0, ...
  4. $1, $1, $1, $1, ...
  5. $1, $2, $3, $4, $5, ....
It depends on how you frame it? - Is it an infinite sequence? - Are we taking the average / median / sum?

1.2 - Rewards and Values

1.3 - Properties of Discounted Rewards

1.4 - Allais Paradox (1953)

  1. What would you prefer?

    1. $1M

      Expected Utility=$1,000,000×1.0=1,000,000\color{#999}\text{Expected Utility}=\$1,000,000\times 1.0=1,000,000

    2. lottery [0.10:$25M,0.89:$1M,0.01:$0][0.10:\$25\text{M}, 0.89:\$1\text{M}, 0.01:\$0] (

      Expected Utility=0.1×$25,000,000+0.89×$1,000,000+0.01×0=3,390,000\color{#999}\text{Expected Utility}=0.1\times\$25,000,000+0.89\times\$1,000,000+0.01\times0=3,390,000

  2. What would you prefer?

    1. lottery [0.11:$1M,0.89:$0][0.11:\$1\text{M}, 0.89:\$0]

      Expected Utility=0.11×$1,000,000+0.89×$0=110000\color{#999}\text{Expected Utility}=0.11\times\$1,000,000+0.89\times \$0=110000

    2. lottery [0.10:$2.5M,0.9:$0][0.10:\$2.5\text{M}, 0.9:\$0]

      Expected Utility=0.10×$2,500,000+0.9×$0=250,000\color{#999}\text{Expected Utility}=0.10\times\$2,500,000+0.9\times\$0=250,000

It is inconsistent with the axioms of preferences to have AB,DCA\succ B, D \succ C

A,C: lottery [0.11:$1M,0.89:X]\text{A,C: lottery } [0.11:\$1\text{M},0.89:X]

B,D: lottery [0.10:$25M,0.01:$0,0.89:X]\text{B,D: lottery }[0.10: \$25\text{M},0.01:\$0, 0.89:X]

1.5 - Framing Effects - Tversky and Kahneman

2.0 - Decision-Theoretic Planning

2.1 - Agents as Processes

2.2 - Decision-Theoretic Planning

2.3 - Initial Assumptions for Decision-Theoretic Planning

3.0 - Markov Decision Processes

3.1 - World State

3.2 - MDPs vs Markov Chains

3.2.1 - MDP Examples - To Exercise or Not || Simple Grid World

Example - To Exercise or Not States = {fit, unfit} Actions = {exercise, relax} Dynamics:          StateActionp(fit|State, Action)fitexercise0.99fitrelax0.7unfitexercise0.2unfitrelax0.0\ \ \ \ \ \ \ \ \ \def\arraystretch{1.4}\begin{array}{|c|c|c|}\hline\text{\textbf{State}} & \text{\textbf{Action}} & \text{\textbf{p(fit|State, Action)}}\\\hline\text{fit} & \text{exercise} & \text{0.99}\\\hline\text{fit} & \text{relax} & \text{0.7}\\\hline\text{unfit} & \text{exercise} & \text{0.2}\\\hline\text{unfit} & \text{relax} & \text{0.0}\\\hline\end{array} % Table generated using https://www.notionsimpletable.com/ %

Reward (does not depend on resulting state)

         StateActionRewardfitexercise8fitrelax10unfitexercise0unfitrelax5\ \ \ \ \ \ \ \ \ \def\arraystretch{1.4}\begin{array}{|l|l|l|}\hline\textsf{\textbf{State}} & \textsf{\textbf{Action}} & \textsf{\textbf{Reward}}\\\hline\textsf{fit} & \textsf{exercise} & \textsf{8}\\\hline\textsf{fit} & \textsf{relax} & \textsf{10}\\\hline\textsf{unfit} & \textsf{exercise} & \textsf{0}\\\hline\textsf{unfit} & \textsf{relax} & \textsf{5}\\\hline\end{array} % Table generated using https://www.notionsimpletable.com/ %

In the long run, we can see that being fit is going to be beneficial (lead to greater reward)

Example - Simple Grid World

States: 100 states corresponding to the position of the agent / robot

Actions: Up, Down, Left, Right

Transition: Robot goes in the commanded direction with probability 0.7 and one of the other directions with probability 0.1

Rewards: If it crashes into an outside wall, it remains in its current position and has a reward of -1. For special reward states - the agent gets the reward when leaving the state.

3.3 - Planning Horizons

3.4 - Information Availability

3.5 - Policy

3.5.1 - MDP Example - To Exercise or Not

To Exercise or Not || Simple Grid World
Example - To Exercise or Not States = {fit, unfit} Actions = {exercise, relax} Dynamics:          StateActionp(fit|State, Action)fitexercise0.99fitrelax0.7unfitexercise0.2unfitrelax0.0\ \ \ \ \ \ \ \ \ \def\arraystretch{1.4}\begin{array}{|c|c|c|}\hline\text{\textbf{State}} & \text{\textbf{Action}} & \text{\textbf{p(fit|State, Action)}}\\\hline\text{fit} & \text{exercise} & \text{0.99}\\\hline\text{fit} & \text{relax} & \text{0.7}\\\hline\text{unfit} & \text{exercise} & \text{0.2}\\\hline\text{unfit} & \text{relax} & \text{0.0}\\\hline\end{array} % Table generated using https://www.notionsimpletable.com/ %

Reward (does not depend on resulting state)

         StateActionRewardfitexercise8fitrelax10unfitexercise0unfitrelax5\ \ \ \ \ \ \ \ \ \def\arraystretch{1.4}\begin{array}{|l|l|l|}\hline\textsf{\textbf{State}} & \textsf{\textbf{Action}} & \textsf{\textbf{Reward}}\\\hline\textsf{fit} & \textsf{exercise} & \textsf{8}\\\hline\textsf{fit} & \textsf{relax} & \textsf{10}\\\hline\textsf{unfit} & \textsf{exercise} & \textsf{0}\\\hline\textsf{unfit} & \textsf{relax} & \textsf{5}\\\hline\end{array} % Table generated using https://www.notionsimpletable.com/ %

In the long run, we can see that being fit is going to be beneficial (lead to greater reward)

Example - Simple Grid World

States: 100 states corresponding to the position of the agent / robot

Actions: Up, Down, Left, Right

Transition: Robot goes in the commanded direction with probability 0.7 and one of the other directions with probability 0.1

Rewards: If it crashes into an outside wall, it remains in its current position and has a reward of -1. For special reward states - the agent gets the reward when leaving the state.

3.5.2 - Solutions to MDP Problems

3.6 - Discounted Rewards

3.7 - Value of a Policy

Colours obtained using this database

3.7.1 - Computing the Value of a Policy

4.0 - Value Iteration (Offline Method 1 of 2)

Given the following Grid World, Value Iteration can be performed to determine the optimal path / sequence of actions.

Shown below are the values of each state V(s)\color{529CCA}V^*(s) (as shown on the left) and the Q(s,a)\color{4DAB9A}Q^*(s,a) value after performing an action.

Note here that in each state, V(s)=max(Q(s,a))V^*(s)=max(Q^*(s,a))

Grid-World with two terminal states {+1, -1} and a series of actions that are possible.
Grid-World with two terminal states {+1, -1} and a series of actions that are possible.
Value Iteration - V-Value after 100 iterations
Value Iteration - V-Value after 100 iterations
Value Iteration - Q-Value after 100 iterations
Value Iteration - Q-Value after 100 iterations

4.1 - Value Iteration Mechanics

  1. Set V0V_0 arbitrarily, e.g.:

    V^(s)0\hat{V}(s)\leftarrow0
  2. Compute Vi+1V_{i+1} from ViV_i - loop for all states ss

    Vi+1(s)=maxasP(sa,s){R(s,a,s)+γVi(s)}V_{i+1}(s)=\text{max}_a \sum_{s'}P(s'|a,s)\{R(s,a,s')+\gamma V_i(s')\}
  3. Once the values converge, recover the best policy from the current value function estimate

    arg maxasP(sa,s){R(s,a,s)+γV^(s)}\text{arg max}_a \sum_{s'}P(s'|a,s)\{R(s,a,s')+\gamma \hat{V}(s')\}

i.e. pick the action that maximises the value of the Q(s,a)Q(s,a) function

4.2 - Value Iteration Example - Grid World

Step 1 - Initialisation

Untitled Untitled

Step 10 - 10 Iterations

Untitled Untitled

4.2.1 - Grid World Environment - Working

V(s)=sP(sπ(s))[R(s,π(s),s)+γVπ(s)]Vi+1=maxasP(sa,s){R(s,a,s)+γVi(s)}{\color{529CCA}V(s)}=\sum_{s'}{\color{9A6DD7}P(s'|\pi(s))} [{\color{4DAB9A}R(s,\pi(s),s')}+{\color{FFA344}\gamma V^\pi(s')}] \\V_{i+1}=max_a \sum_{s'}{\color{9A6DD7}P(s'|a,s)} \{{\color{4DAB9A}R(s,a,s')} + \gamma {\color{529CCA}V_i(s')}\}
  1. Suppose we enumerate the cells as shown in the right
  2. When looking at the cells from which the agent can move from and have a change in reward by moving from one step, note that the only two cells that will be updated are indicated in yellow - (2,3) & (3,3)\color{FFA344}(2,3)\ \&\ (3,3)
  3. From this, and the information about the determinism of moving into a particular state, we can create a state transition table, which describes the probability of ending up in a particular state, when moving from an initial state.

Start (s) / End( s’)(2,4)(3,4)(1,4)(3,3)(2,3)0.70.150.15(3,3)0.150.70.15\def\arraystretch{1.4}\begin{array}{|l|l|l|l|l|}\hline\textsf{\textbf{Start (s) / End( s')}} & \textsf{\textbf{(2,4)}} & \textsf{\textbf{(3,4)}} & \textsf{\textbf{(1,4)}} & \textsf{\textbf{(3,3)}}\\\hline\textsf{(2,3)} & \textsf{0.7} & \textsf{0.15} & \textsf{0.15} & \textsf{}\\\hline\textsf{(3,3)} & \textsf{0.15} & \textsf{0.7} & \textsf{} & \textsf{0.15}\\\hline\end{array} (Transition Table for action RIGHT)

4.3 - Asynchronous Value Iteration

4.3.1 - Asynchronous Value Iteration

Storing V[s]

Storing Q[s, a]

5.0 - Policy Iteration (Offline Method 2 of 2)

5.1 - Policy Iteration

5.1.1 - Policy Iteration Example - Grid World Environment

  1. Set the policy to an arbitrary value for every state.

  2. Solve for VπiV^{\pi i} for each state

  3. Choose the action that will optimise the value for each state and update the policy.

  4. Repeat several times until the policy converges