L12 - Reasoning About Other Agents

In order to reason about other agents, we need to have some reasoning about what the other agent(s) are thinking about

0.1 - Strategic Uncertainty

There are two key examples of strategic uncertainty
1. Simultaneous move / imperfect information - unaware of what your opponent is going to do - e.g. scissors paper rock
2. Perfect Information - You can see the entire environment and what moves your opponent has made up to that point, so you can reason about what they're going to do next.

1.0 - Single Agents vs Multiple Agents

Many domains are characterised by multiple agents rather than a single agent - these are multi-agent systems.
Multi-agent systems are a computational system consisting of two or more interacting intelligent agents
Multi-agent systems are used to model and solve complex problems that are difficult or impossible for an individual agent or monolithic system to solve
Agents can be cooperative, competitive or somewhere in-between
Agents that are strategic can't be modelled as nature.
- Trading on stock market → Build probabilistic model to model the stock market.
- Bidding at an auction → Model as adversarial agents - "You can see what they're doing"

1.2 - Game Theory

Game theory is an umbrella term for the study of mathematical models of strategic interaction among rational agents
The name "game" theory is traced to its origin analysing zero-sum games such as chess and card games, in which each participant's gains or losses are exactly balanced to those of other participants
It is now used to model all manner of scenarios, from cooperation and coordination, to conflict and deceit.

🧠 Key Idea: Game Theory provides a model for reasoning over other agents' action and behaviours in AI, and is considered a foundation of multi-agent system design and analysis.

1.2.1 - Game Theory: Normal v Extensive Form

A key distinction in game regards the timing of when information about one agent becomes available to other agents in the environment
Extensive Form Agents / players take turns to act
- The game is modelled by a game graph where game states are nodes, and each action is represented by an edge.
- Perfect Information The agent is able to see other agents' moves, akin to fully-observable systems where you can see every action that a player has taken up to that point (e.g. Tic-Tac-Toe, Chess, Go)
- Imperfect Information Like Partially Observable systems cannot fully see other agents' moves (e.g. card games, Battleships)
Normal Form Simultaneous moves, players cannot observe each others' action choice:
- Typically represented by a payoff matrix
- E.g. rock-paper-scissors

🧠 Extensive form of a sharing game with perfect information

Players alternate taking moves at each other
Actions are represented by edges, payoffs are indicated by tuples at the leaf nodes (Andy's Reward, Barb's Reward)

Andy observes the initial state of the game and chooses to either keep, share or give the reward.
Barb chooses whether the action will be performed.

We introduce the ellipse which we call an information set.
- Bob cannot distinguish the nodes in an information set.
Note that this is a zero sum game, where the sum of the rewards are 0.
We could model the game more concisely using a payoff matrix as shown below
By convention the tuple of rewards are in the order (row_player, col_player)

1.3 - Multi-Agent Framework

Each agent can have its own utility function
Agents select actions autonomously
Agents can have different information
The outcome can depend on the actions of all the agents
Each agent's value depends on the outcome

2.0 - Minimax in Zero-Sum Games of Perfect Information

If agents act sequentially and can observe the state before acting → Perfect Information Games
Can do dynamic programming or search: Each agent maximises for itself
- When we apply this to two-person zero-sum games (pure conflict), we use minimax search (very similar to AND-OR trees)
- Most real games are too big to carry out minimax search
- When we apply this to general-sum games, we can use "subgame perfect equilibrium solutions"
Topics beyond this course:
- Multi-Agent MDPs: A value function for each agent - Each agent maximises its own value function
- Multi-Agent Reinforcement Learning: Each agent has its own Q-function, rewards coupled through the environment

2.1 - Zero-Sum Games: Adversarial World

Making good decisions in zero-sum games requires respecting your opponents
- Take into account what your opponents will do
- Assume your opponents are rational and smart, (at least until proven otherwise)
Two-Player Zero-Sum Games
- Turn-Taking → Perfect Information
- Deterministic → The game is deterministic but the agent does not know what the opponent will do, and hence to the agent, the environment is non-deterministic
  - Performing an action is deterministic, but other agents' moves are not necessarily deterministic (logical)
- Fully observable
- Zero-sum → The total gain from the winning participants minus the total losses from the losing participants is zero
Essentially, one's winning means the opponent's loss (e.g. Tic-Tac-Toe, Chess, Go)

2.2 - The Zero-Sum Perfect Information Game Environment

State Space
Action Space
Initial State - initial hand of cards, initial configuration of a chessboard
World Dynamics: Represents the outcome of the agent's move, followed by the possible game state(s) after the opponent moves
Observation & Observation Function: Not needed as the state is fully observable with perfect information
Utility: +1 win, -1 lose

Similar to an AND-OR tree
- The OR-Level represents the agent's move - in search we maximise this value
- The AND-Level represents the opponent's move - in search we minimise this value
  - In Zero-Sum games, we assume that a move that is beneficial for another agent is proportionally bad for us.
  - We essentially calculate the best move for our opponent; worst move for us.

In general, the branching factor and the depth of terminal states are large.
For example for chess:
- Number of states $\sim10^{40}$
- Branching factor $\sim35$
- Number of total moves in a game $\sim100$

2.4 - Online Search for Games?

🧠 We try to find intermediate states in the game that 'look good'?

Calculate the solution for the current state.
- Perform a search with the current state as the root of the tree
- Perform the first action of the solution
- Recompute the solution from the new state, utilising the previous sub-trees if possible.
Could we use MCTS for this search problem?
- AI systems that play Go and Chess use a version of Adversarialised MCTS

2.4.1 - Choosing an Action: Basic Idea of Minimax Algorithm

Using the current state as the initial state, build the game tree to the maximal depth h (called decision horizon) feasible within a computational time limit
Evaluate the states of the leaf nodes
- Use a heuristic as an evaluation function to estimate how favourable a state is
Back-up the results from the leaves to the root
- At each non-leaf node N, the backed-up value is the value of the best state that MAX can reach at depth h if MIN plays well (by the same criterion as MAX applies to itself; MIN plays rationally)
- Same criterion = same evaluation function
  - !! Assumptions within the MINIMAX Algorithm !!
Select the move toward a MIN node that has the largest backed-up value (from the leaf nodes)

2.5 - Evaluation Function

We need a heuristic to estimate how favourable a game state is for the agent (MAX)
Usually called the evaluation function $e\ :\ S \rightarrow R$
- $e(s)\gt0: s$ is favourable to MAX - the larger the beter
- $e(s)\lt 0:s$ is favourable to MIN
- $e(s)=0:s$ is neutral
Designing a good evaluation function is difficult.

The evaluation is usually a weighted sum of the features
A linear function of variables $x_0,\cdots, x_n$ is of the form

$e^{\bar{w}}=\sum^N_{i=1}w_if_i(s)$
Where $\bar{w}=(w_0, w_1, \cdots, w_n)$ are the weights and each $f_i(s)$ are the features.
For example, features in Chess may include:
- Number of pieces of each type
- Number of possible moves
- Number of square controlled.

2.5.1 - Evaluation Function - Tic-Tac-Toe

e(s)=\text{\# of rows, cols and diagonals where MAX can win} - \\\text{\# of rows, cols and diagonals where MIN can win}

Suppose the agent is represented by the crosses. Then, the evaluation function for the following states are:

Suppose at the first layer we consider putting a piece on the board as polymorphisms - in this case there are only 3 possible moves.
From this, generate all possible moves for the opponent and back-up the result from the leaf nodes.

2.6 - Minimax Algorithm

Expand the game tree from the current state (where it is MAX's turn to play) to the depth h
Compute the evaluation function at every leaf of the tree
Back up the results from the leaves to the root node and pick the best action assuming the worst from MIN
- A MAX node gets the maximum evaluation of its successors
- A MIN node gets the minimum evaluation of its successors
Select the move toward a MIN node that has the largest backed-up value.

2.6.1 - Minimax for Game Playing

Repeat until a terminal state is reached
1. Select move using Minimax
2. Execute move
3. Observe MIN's move
Note that at each cycle the

2.7 - Alpha-Beta Pruning

🧠 A heuristic that works in practice, but doesn't provide any guarantees on better worst-case performance.

Maintain an upper and lower bound of the evaluation function at each node
- $\alpha:$ Best already explored option along the path to the root for the maximiser
- $\beta:$ Best already explored option along the path to the root for the minimiser

Explore the game tree to depth h in depth-first manner
Back up $\alpha$ and $\beta$ whenever possible
Prune branches that cannot change the final decision
Extremely beneficial for game trees with greater depth - save a lot of computation by early pruning.

3.0 - Equilibrium in General-Sum Games

🧠 Focusing on Normal-Form Games

The strategic form of a game or normal game form $\langle I, \{A_i, u_i\}_{i\in I} \rangle :$
- A finite set $I$ of agents $\{1, \cdots, n\}$
- And for each agent $i\in I$
  - A set of actions $A_i$
    - A pure strategy profile is a tuple $a=(a_1,\cdots ,a_n)$ in which agent $i$ carries out action $a_i$
    - [In games, we extend →] A mixed strategy profile is a tuple of distributions over joint actions $\sigma=(\sigma_1, ..., \sigma_n)$ in which agent $i$ draws an action from the lottery $\sigma_i\in\Delta(a_i)$ where $\Delta(a_i)$ is the simplex over $A_i$
      - Involve some component of randomness or probability.
      - If we're using a mixed strategy table, we need to compute the outcomes for $num\_player\_actions\times num\_opponent\_outcomes$
  - A utility function $u_i(\sigma)$ for strategy profile $\sigma$ , which gives the expected utility for agent $i$ when all agents follow their component of $\sigma$
Under mild conditions on $A_i$ and $u_i$ , there always exists a mixed strategy profile in which no agent benefits from adjusting their strategy - this is the Nash equilibrium solution.
- Mild conditions → finite action space, bounded utility function & analytical softening.
- In Scissors Paper Rock
  - If your opponent chooses actions at random with $\frac{1}{3}$ probability each, your best choice is to choose actions with the same probability distribution
  - If your opponent chooses actions at random, with $\frac{1}{2}$ probability for paper and $\frac{1}{4}$ probability for rock and scissors, your best bet is to (randomly) choose scissors with probability $\frac{1}{2}$ and scissors and paper with probability $\frac{1}{4}$
Pure Actions → {Rock, Paper, Scissors}
Pure Strategy Profile → Determines the outcome of the game
- If player A plays "Scissors" and player B plays "Rock", then ......
Simplex → Probability distribution over all items, where total probability sums to 1

3.1 - Simultaneous Moves and Normal Form Games

Consider a penalty kick in soccer. The kicker can kick to his left or right. The goalkeeper can jump to their left or right.
The matrix shows the probability of scoring a goal.

Suppose the kicker decides to kick to the right with probability $p_k$ , and that the goalkeeper decides to jump to their right with probability $p_g$ .
The probability of scoring a goal is given by

$P(\text{goal})=0.9p_k p_h+0.3p_k(1-p_g)+0.2(1-p_k)p_g+0.6(1-p_k)(1-p_g)$
The figure to the right shows the probability of a goal as a function of $p_k$ .
The different lines correspond to different values of $p_g\in\{ 0.0, 0.2, 0.4, 0.6, 0.8, 1.0\}$
There is something special about $p_k=0.4$ - at this value, the probability of a goal is $0.48$ , independent of the value of $p_g$
The equivalent plot for $p_g$ is similar, with all of the lines meeting at $p_g=0.3$ . Again when $p_g=0.3$ , the probability of a goal is $0.48$
This point is the equilibrium point of both players' strategies.

4.0 - Strategic Dominance

A simple way to reason over which strategies to use in a game is by strategic dominance analysis.
- $\sigma_{-i}$ represents the mixed strategy for all players in the game except for player $i$
- $\bar{\sigma}_i$ is a particular strategy, and $\underline{\sigma}_i$ is every other strategy
An agent's strategy dominates another strategy if it is always a better choice, irrespective of how the other agents in the game play; that is, if the following holds $\forall \sigma_{-i}\in\Delta_{-i},\bar{\sigma}_i\ne\underline{\sigma}_i:$

$u_i(\bar{\sigma}_i,\sigma_{-1})\gt u_i(\underline{\sigma}_i, \sigma_{-i}) \tag{i}$

then $\bar{\sigma}_i$ dominates $\underline{\sigma}_i$
Similarly, a strategy $\underline{\sigma}_i$ is dominated if another strategy is always a better choice, independent of the other agents' actions

$u_i(\underline{\sigma}_i, \sigma_{-i})\lt u_i(\sigma_i, \sigma{-i})\ \ \ \ \forall \sigma_{-i}\in\Delta_{-i}, \underline{\sigma}_i\ne \sigma_i \tag{2}$
Both concepts above rely on strict inequalities, but they can be weakened to consider weakly-dominating or weakly-dominated strategies respectively using weak inequalities.

4.1 - Iterative Elimination of Weakly-Dominated Strategies

🧠 Iterative Elimination of Weakly-Dominated Strategies can be applied to any game

This algorithmic procedure starts by choosing an agent, removing all weakly-dominated strategies from its action space
The idea is that a weakly dominated strategy will never be played, so it should not feature in any of the agents' strategic reasoning.
The elimination step is repeated for the next agent, and so on; and then the entire process is repeated for all agents until no further strategies can be eliminated.
At the end, the remaining joint strategy space gives a reduced form of the game which can be much easier to analyse or reason over, and in some cases, it can even return a unique solution to the game (although this is certainly not guaranteed in general).
- This algorithm looks remarkably similar to arc-consistency.

4.2 - Dominant Strategy Equilibrium

🧠 This occurs when each agent has a dominant strategy - so each agent plays a dominant strategy.

A second approach is to find dominant strategies, and look for a dominant strategy equilibrium (DSE)
A DSE is a strategy profile in which each agent's strategy weakly dominates all other strategies:

$u_i(\sigma_i^{\text{DSE}}, \sigma{-i})\ge u_i(\sigma_i, \sigma_{-i})\ \ \ \forall\sigma_i\ne\sigma^{\text{DSE}}_{i}, \forall i\in I \tag{3}$
That is, an agent's strategy is at least as good as all other strategies, irrespective of the strategies of all other players.
DSE is a strong notion of equilibrium and is not guaranteed to exist; it appears in only limited contexts.

5.0 - Nash Equilibrium

The notion of a Nash Equilibrium relies on a concept of an agent's best response to the other strategies of agents (rather than dominant or dominated strategies)
A best response for an agent $i$ to the strategies $\sigma_{-i}$ of the other agents is a strategy that has maximum utility for that agent (it is possible that there are several pure actions that have the same utility)
That is, $\sigma_i$ is a best response to $\sigma_{-i}$ if for all other strategies $\sigma^{'}_i$ for agent $i$ :

u_i(\sigma_i, \sigma_{-i})\ge u_i(\sigma^{'}_i,\sigma_{-i})

Since the inequality is greater than or equal to, we could say that this is the weak best response.
- By convention, this is the way to define the best response.

A strategy profile $\sigma^*$ is a Nash Equilibrium strategy if $\sigma^*_i$ is a best response to $\sigma^*_{-i}$ for every agent $i$

_i(\sigma^*)-u_i(\sigma_i,\sigma^*_{-i})\ge0\ \ \ \ \forall\sigma_i\in\Delta(A_i), \forall i \in I

That is, a Nash equilibrium is a strategy profile such that no agent can do better by unilaterally deviating from that profile.

5.1 - Nash Equilibrium - Key Results

One of the great results of game theory (proven by Nash in 1950) is that every (normal form) finite game has at least one Nash equilibrium
A second great result (in the same paper) regards symmetric games - These are games in which all players have the same actions and symmetric payoffs given each individual's action. Every finite symmetric game has a symmetric Nash equilibrium, in which actions are played with the same probability by all players as in scissors, paper, rock
- We can have pure strategy Nash equilibriums where the strategies aren't randomised.
A zero sum game is a game in which payoffs for all players in each outcome sum to zero. Another useful result of game theory is that in every finite two-player zero-sum game, every Nash equilibrium is equivalent to a mixed-strategy minimax outcome.
From (2) and (3), in the equilibrium of a two-player, symmetric, zero-sum game, each player must receive a payoff of 0, and the two-player symmetric, zero-sum games always have equilibria in symmetric strategies

Note: Any constant-sum game can be normalised to make it equivalent to a zero-sum game.\

5.2 - Computing Nash Equilibrium

To compute a Nash equilibrium for a game in normal form, there are three steps
1. Eliminate dominated strategies
2. Determine which remaining actions should be in the support set
  - Support set - the set of actions that have non-zero probabilities
  - In practice, it's difficult to determine the support set
3. Determine the probability for the actions in the support set
However, in games with only two actions (or if iterative elimination of dominated strategies leaves only two actions), we don't have to think about it.

5.2.1 - Goalkeeper Example (Again)

If the goalkeeper dives right, the probability of a goal is

$P(\text{goal } | \text{ right})=0.9p_k+0.2(1-p_k)$
If the goalkeeper dives left, the probability of a goal is

$P(\text{goal | left})=0.3p_k+0.6(1-p_k)$

The only time the goalkeeper would randomise their actions is if these two quantities are equal

P(\text{goal } | \text{ right})=P(\text{goal | left})=0.9p_k+0.2(1-p_k)=0.3p_k+0.6(1-p_k)

Solving for $p_k$ gives $p_k=0.4$

🧠 At what probability of $p_g$ is the kicker indifferent to kicking left or right? i.e. induces the kicker to randomise?

0.2p_g+0.6(1-p_g)=0.9p_g+0.3(1-p_g)\\p_g=0.3

Thus, the Nash Equilibrium is at $(p_k, p_g)=(0.4, 0.3)$

5.3 - Morra

🧠 In the game of Morra, each player shows either one or two fingers and announces a number between 2 and 4. If a player’s number is equal to the sum of the number of fingers shown, then her opponent must pay her that many dollars. The payoff is the net transfer, so that both players earn zero if both or neither guess the correct number of fingers shown. In this game, each player has 6 strategies:

They may show one finger and guess 2;

They may show one finger and guess 3;

They may show one finger and guess 4;

They may show two fingers and guess one of the three numbers (x3)

There are two weakly dominated strategies in Morra - what are they?

It never pays to put out one finger and guess that the total number of fingers will be 4, because the other player cannot put out more than two fingers.
Likewise, it never pays to put out two fingers and guess that the sum will be 2, because the other player must put down at least one finger.

This leads to the following reduced payoff matrix for a player

Imagine that player A can read player B's mind and guess how he plays before he makes his move. What pure strategy should player B use?

Player B consults a textbook and decides to use randomisation to improve his performance in Morra. Ideally, if he can find the best mixed strategy to play, what would be his expected payoff?

This is a trick question - it depends on what his opponent does! Will his opponent adapt?
What happens if Player B knows that Player A favours one strategy over another
What happens if Player B knows that Player A randomises uniformly over all strategies?
If Player A can still read his mind to see his mixed strategy, how should Player B approach this?
Since this is a symmetric two-player zero-sum game we know that an equilibrium must be symmetric. What does this imply about the equilibrium rewards?

One possible mixed strategy is to play show one finger and call “three” with probability 0.6, and to show two fingers and call “three” with probability 0.4 (and play the other strategies with probability 0). Is this a Nash equilibrium strategy? Assume that Player B is risk neutral with respect to the game payoffs