In order to reason about other agents, we need to have some reasoning about what the other agent(s) are thinking about
0.1 - Strategic Uncertainty
There are two key examples of strategic uncertainty
Simultaneous move / imperfect information - unaware of what your opponent is going to do - e.g. scissors paper rock
Perfect Information - You can see the entire environment and what moves your opponent has made up to that point, so you can reason about what they're going to do next.
1.0 - Single Agents vs Multiple Agents
Many domains are characterised by multiple agents rather than a single agent - these are multi-agent systems.
Multi-agent systems are a computational system consisting of two or more interacting intelligent agents
Multi-agent systems are used to model and solve complex problems that are difficult or impossible for an individual agent or monolithic system to solve
Agents can be cooperative, competitive or somewhere in-between
Agents that are strategic can't be modelled as nature.
Trading on stock market → Build probabilistic model to model the stock market.
Bidding at an auction → Model as adversarial agents - "You can see what they're doing"
1.2 - Game Theory
Game theory is an umbrella term for the study of mathematical models of strategic interaction among rational agents
The name "game" theory is traced to its origin analysing zero-sum games such as chess and card games, in which each participant's gains or losses are exactly balanced to those of other participants
It is now used to model all manner of scenarios, from cooperation and coordination, to conflict and deceit.
🧠 Key Idea: Game Theory provides a model for reasoning over other agents' action and behaviours in AI, and is considered a foundation of multi-agent system design and analysis.
1.2.1 - Game Theory: Normal v Extensive Form
A key distinction in game regards the timing of when information about one agent becomes available to other agents in the environment
Extensive Form Agents / players take turns to act
The game is modelled by a game graph where game states are nodes, and each action is represented by an edge.
Perfect Information The agent is able to see other agents' moves, akin to fully-observable systems where you can see every action that a player has taken up to that point (e.g. Tic-Tac-Toe, Chess, Go)
Imperfect Information Like Partially Observable systems cannot fully see other agents' moves (e.g. card games, Battleships)
Normal Form Simultaneous moves, players cannot observe each others' action choice:
Typically represented by a payoff matrix
E.g. rock-paper-scissors
1.2.2 - Sharing Game with Perfect Information
🧠 Extensive form of a sharing game with perfect information
Players alternate taking moves at each other
Actions are represented by edges, payoffs are indicated by tuples at the leaf nodes (Andy's Reward, Barb's Reward)
Andy observes the initial state of the game and chooses to either keep, share or give the reward.
Barb chooses whether the action will be performed.
1.2.3 - Sharing Game with Imperfect Information
We introduce the ellipse which we call an information set.
Bob cannot distinguish the nodes in an information set.
Note that this is a zero sum game, where the sum of the rewards are 0.
We could model the game more concisely using a payoff matrix as shown below
By convention the tuple of rewards are in the order (row_player, col_player)
1.3 - Multi-Agent Framework
Each agent can have its own utility function
Agents select actions autonomously
Agents can have different information
The outcome can depend on the actions of all the agents
Each agent's value depends on the outcome
2.0 - Minimax in Zero-Sum Games of Perfect Information
If agents act sequentially and can observe the state before acting → Perfect Information Games
Can do dynamic programming or search: Each agent maximises for itself
When we apply this to two-person zero-sum games (pure conflict), we use minimax search (very similar to AND-OR trees)
Most real games are too big to carry out minimax search
When we apply this to general-sum games, we can use "subgame perfect equilibrium solutions"
Topics beyond this course:
Multi-Agent MDPs: A value function for each agent - Each agent maximises its own value function
Multi-Agent Reinforcement Learning: Each agent has its own Q-function, rewards coupled through the environment
2.1 - Zero-Sum Games: Adversarial World
Making good decisions in zero-sum games requires respecting your opponents
Take into account what your opponents will do
Assume your opponents are rational and smart, (at least until proven otherwise)
Two-Player Zero-Sum Games
Turn-Taking → Perfect Information
Deterministic → The game is deterministic but the agent does not know what the opponent will do, and hence to the agent, the environment is non-deterministic
Performing an action is deterministic, but other agents' moves are not necessarily deterministic (logical)
Fully observable
Zero-sum → The total gain from the winning participants minus the total losses from the losing participants is zero
Essentially, one's winning means the opponent's loss (e.g. Tic-Tac-Toe, Chess, Go)
2.2 - The Zero-Sum Perfect Information Game Environment
State Space
Action Space
Initial State - initial hand of cards, initial configuration of a chessboard
World Dynamics: Represents the outcome of the agent's move, followed by the possible game state(s) after the opponent moves
Observation & Observation Function: Not needed as the state is fully observable with perfect information
Utility: +1 win, -1 lose
Similar to an AND-OR tree
The OR-Level represents the agent's move - in search we maximise this value
The AND-Level represents the opponent's move - in search we minimise this value
In Zero-Sum games, we assume that a move that is beneficial for another agent is proportionally bad for us.
We essentially calculate the best move for our opponent; worst move for us.
In general, the branching factor and the depth of terminal states are large.
For example for chess:
Number of states ∼1040
Branching factor ∼35
Number of total moves in a game ∼100
2.4 - Online Search for Games?
🧠 We try to find intermediate states in the game that 'look good'?
Calculate the solution for the current state.
Perform a search with the current state as the root of the tree
Perform the first action of the solution
Recompute the solution from the new state, utilising the previous sub-trees if possible.
Could we use MCTS for this search problem?
AI systems that play Go and Chess use a version of Adversarialised MCTS
2.4.1 - Choosing an Action: Basic Idea of Minimax Algorithm
Using the current state as the initial state, build the game tree to the maximal depth h (called decision horizon) feasible within a computational time limit
Evaluate the states of the leaf nodes
Use a heuristic as an evaluation function to estimate how favourable a state is
Back-up the results from the leaves to the root
At each non-leaf node N, the backed-up value is the value of the best state that MAX can reach at depth h if MIN plays well (by the same criterion as MAX applies to itself; MIN plays rationally)
Same criterion = same evaluation function
!! Assumptions within the MINIMAX Algorithm !!
Select the move toward a MIN node that has the largest backed-up value (from the leaf nodes)
2.5 - Evaluation Function
We need a heuristic to estimate how favourable a game state is for the agent (MAX)
Usually called the evaluation functione:S→R
e(s)>0:s is favourable to MAX - the larger the beter
e(s)<0:s is favourable to MIN
e(s)=0:s is neutral
Designing a good evaluation function is difficult.
The evaluation is usually a weighted sum of the features
A linear function of variables x0,⋯,xn is of the form
ewˉ=i=1∑Nwifi(s)
Where wˉ=(w0,w1,⋯,wn) are the weights and each fi(s) are the features.
For example, features in Chess may include:
Number of pieces of each type
Number of possible moves
Number of square controlled.
2.5.1 - Evaluation Function - Tic-Tac-Toe
e(s)=# of rows, cols and diagonals where MAX can win−# of rows, cols and diagonals where MIN can win
Suppose the agent is represented by the crosses. Then, the evaluation function for the following states are:
Suppose at the first layer we consider putting a piece on the board as polymorphisms - in this case there are only 3 possible moves.
From this, generate all possible moves for the opponent and back-up the result from the leaf nodes.
2.6 - Minimax Algorithm
Expand the game tree from the current state (where it is MAX's turn to play) to the depth h
Compute the evaluation function at every leaf of the tree
Back up the results from the leaves to the root node and pick the best action assuming the worst from MIN
A MAX node gets the maximum evaluation of its successors
A MIN node gets the minimum evaluation of its successors
Select the move toward a MIN node that has the largest backed-up value.
2.6.1 - Minimax for Game Playing
Repeat until a terminal state is reached
Select move using Minimax
Execute move
Observe MIN's move
Note that at each cycle the
2.7 - Alpha-Beta Pruning
🧠 A heuristic that works in practice, but doesn't provide any guarantees on better worst-case performance.
Maintain an upper and lower bound of the evaluation function at each node
α: Best already explored option along the path to the root for the maximiser
β: Best already explored option along the path to the root for the minimiser
Explore the game tree to depth h in depth-first manner
Back up α and β whenever possible
Prune branches that cannot change the final decision
Extremely beneficial for game trees with greater depth - save a lot of computation by early pruning.
The strategic form of a game or normal game form ⟨I,{Ai,ui}i∈I⟩:
A finite set I of agents {1,⋯,n}
And for each agent i∈I
A set of actions Ai
A pure strategy profile is a tuple a=(a1,⋯,an) in which agent i carries out action ai
[In games, we extend →] A mixed strategy profile is a tuple of distributions over joint actions σ=(σ1,...,σn) in which agent i draws an action from the lottery σi∈Δ(ai) where Δ(ai) is the simplex over Ai
Involve some component of randomness or probability.
If we're using a mixed strategy table, we need to compute the outcomes for num_player_actions×num_opponent_outcomes
A utility function ui(σ) for strategy profile σ, which gives the expected utility for agent i when all agents follow their component of σ
Under mild conditions on Ai and ui, there always exists a mixed strategy profile in which no agent benefits from adjusting their strategy - this is the Nash equilibrium solution.
If your opponent chooses actions at random with 31 probability each, your best choice is to choose actions with the same probability distribution
If your opponent chooses actions at random, with 21 probability for paper and 41 probability for rock and scissors, your best bet is to (randomly) choose scissors with probability 21 and scissors and paper with probability 41
Pure Actions → {Rock, Paper, Scissors}
Pure Strategy Profile → Determines the outcome of the game
If player A plays "Scissors" and player B plays "Rock", then ......
Simplex → Probability distribution over all items, where total probability sums to 1
3.1 - Simultaneous Moves and Normal Form Games
Consider a penalty kick in soccer. The kicker can kick to his left or right. The goalkeeper can jump to their left or right.
The matrix shows the probability of scoring a goal.
Suppose the kicker decides to kick to the right with probability pk, and that the goalkeeper decides to jump to their right with probability pg.
The figure to the right shows the probability of a goal as a function of pk.
The different lines correspond to different values of pg∈{0.0,0.2,0.4,0.6,0.8,1.0}
There is something special about pk=0.4 - at this value, the probability of a goal is 0.48, independent of the value of pg
The equivalent plot for pg is similar, with all of the lines meeting at pg=0.3. Again when pg=0.3, the probability of a goal is 0.48
This point is the equilibrium point of both players' strategies.
4.0 - Strategic Dominance
A simple way to reason over which strategies to use in a game is by strategic dominance analysis.
σ−i represents the mixed strategy for all players in the game except for player i
σˉi is a particular strategy, and σi is every other strategy
An agent's strategy dominates another strategy if it is always a better choice, irrespective of how the other agents in the game play; that is, if the following holds ∀σ−i∈Δ−i,σˉi=σi:
ui(σˉi,σ−1)>ui(σi,σ−i)(i)
then σˉi dominates σi
Similarly, a strategy σi is dominated if another strategy is always a better choice, independent of the other agents' actions
Both concepts above rely on strict inequalities, but they can be weakened to consider weakly-dominating or weakly-dominated strategies respectively using weak inequalities.
4.1 - Iterative Elimination of Weakly-Dominated Strategies
🧠 Iterative Elimination of Weakly-Dominated Strategies can be applied to any game
This algorithmic procedure starts by choosing an agent, removing all weakly-dominated strategies from its action space
The idea is that a weakly dominated strategy will never be played, so it should not feature in any of the agents' strategic reasoning.
The elimination step is repeated for the next agent, and so on; and then the entire process is repeated for all agents until no further strategies can be eliminated.
At the end, the remaining joint strategy space gives a reduced form of the game which can be much easier to analyse or reason over, and in some cases, it can even return a unique solution to the game (although this is certainly not guaranteed in general).
This algorithm looks remarkably similar to arc-consistency.
4.2 - Dominant Strategy Equilibrium
🧠 This occurs when each agent has a dominant strategy - so each agent plays a dominant strategy.
A second approach is to find dominant strategies, and look for a dominant strategy equilibrium (DSE)
A DSE is a strategy profile in which each agent's strategy weakly dominates all other strategies:
ui(σiDSE,σ−i)≥ui(σi,σ−i)∀σi=σiDSE,∀i∈I(3)
That is, an agent's strategy is at least as good as all other strategies, irrespective of the strategies of all other players.
DSE is a strong notion of equilibrium and is not guaranteed to exist; it appears in only limited contexts.
5.0 - Nash Equilibrium
The notion of a Nash Equilibrium relies on a concept of an agent's best response to the other strategies of agents (rather than dominant or dominated strategies)
A best response for an agent i to the strategies σ−i of the other agents is a strategy that has maximum utility for that agent (it is possible that there are several pure actions that have the same utility)
That is, σi is a best response to σ−iif for all other strategies σi′ for agent i:
ui(σi,σ−i)≥ui(σi′,σ−i)
Since the inequality is greater than or equal to, we could say that this is the weak best response.
By convention, this is the way to define the best response.
A strategy profile σ∗ is a Nash Equilibrium strategy if σi∗ is a best response to σ−i∗ for every agent i
i(σ∗)−ui(σi,σ−i∗)≥0∀σi∈Δ(Ai),∀i∈I
That is, a Nash equilibrium is a strategy profile such that no agent can do better by unilaterally deviating from that profile.
5.1 - Nash Equilibrium - Key Results
One of the great results of game theory (proven by Nash in 1950) is that every (normal form) finite game has at least one Nash equilibrium
A second great result (in the same paper) regards symmetric games - These are games in which all players have the same actions and symmetric payoffs given each individual's action. Every finite symmetric game has a symmetric Nash equilibrium, in which actions are played with the same probability by all players as in scissors, paper, rock
We can have pure strategy Nash equilibriums where the strategies aren't randomised.
A zero sum game is a game in which payoffs for all players in each outcome sum to zero. Another useful result of game theory is that in every finite two-player zero-sum game, every Nash equilibrium is equivalent to a mixed-strategy minimax outcome.
From (2) and (3), in the equilibrium of a two-player, symmetric, zero-sum game, each player must receive a payoff of 0, and the two-player symmetric, zero-sum games always have equilibria in symmetric strategies
Note: Any constant-sum game can be normalised to make it equivalent to a zero-sum game.\
5.2 - Computing Nash Equilibrium
To compute a Nash equilibrium for a game in normal form, there are three steps
Eliminate dominated strategies
Determine which remaining actions should be in the support set
Support set - the set of actions that have non-zero probabilities
In practice, it's difficult to determine the support set
Determine the probability for the actions in the support set
However, in games with only two actions (or if iterative elimination of dominated strategies leaves only two actions), we don't have to think about it.
5.2.1 - Goalkeeper Example (Again)
If the goalkeeper dives right, the probability of a goal is
P(goal ∣ right)=0.9pk+0.2(1−pk)
If the goalkeeper dives left, the probability of a goal is
P(goal | left)=0.3pk+0.6(1−pk)
The only time the goalkeeper would randomise their actions is if these two quantities are equal
🧠 At what probability of pg is the kicker indifferent to kicking left or right? i.e. induces the kicker to randomise?
0.2pg+0.6(1−pg)=0.9pg+0.3(1−pg)pg=0.3
Thus, the Nash Equilibrium is at (pk,pg)=(0.4,0.3)
5.3 - Morra
🧠 In the game of Morra, each player shows either one or two fingers and announces a number between 2 and 4. If a player’s number is equal to the sum of the number of fingers shown, then her opponent must pay her that many dollars. The payoff is the net transfer, so that both players earn zero if both or neither guess the correct number of fingers shown. In this game, each player has 6 strategies:
They may show one finger and guess 2;
They may show one finger and guess 3;
They may show one finger and guess 4;
They may show two fingers and guess one of the three numbers (x3)
There are two weakly dominated strategies in Morra - what are they?
It never pays to put out one finger and guess that the total number of fingers will be 4, because the other player cannot put out more than two fingers.
Likewise, it never pays to put out two fingers and guess that the sum will be 2, because the other player must put down at least one finger.
This leads to the following reduced payoff matrix for a player
Imagine that player A can read player B's mind and guess how he plays before he makes his move. What pure strategy should player B use?
Player B consults a textbook and decides to use randomisation to improve his performance in Morra. Ideally, if he can find the best mixed strategy to play, what would be his expected payoff?
This is a trick question - it depends on what his opponent does! Will his opponent adapt?
What happens if Player B knows that Player A favours one strategy over another
What happens if Player B knows that Player A randomises uniformly over all strategies?
If Player A can still read his mind to see his mixed strategy, how should Player B approach this?
Since this is a symmetric two-player zero-sum game we know that an equilibrium must be symmetric. What does this imply about the equilibrium rewards?
One possible mixed strategy is to play show one finger and call “three” with probability 0.6, and to show two fingers and call “three” with probability 0.4 (and play the other strategies with probability 0). Is this a Nash equilibrium strategy? Assume that Player B is risk neutral with respect to the game payoffs