Computer-based models learn from data, identify patterns and make decisions with minimal human intervention
Machines that Learn
Incredible success in many applications
Image recognition
Image segmentation (labelling pixels into different classes / groups)
Style transfer
AlphaGo
Breakout (Atari)
1.1 - Machine Learning Applications
Image recognition
Speech recognition
Traffic prediction
Recommender systems
Self-driving cars
Email spam filtering
Social media (Facebook, Twitter, etc) - recommender systems / advertising
Games: Chess, Atari, Go, DoTA, Poker
Medical/Health: Radiology image interpretation, predicting epidemic outbreaks etc.
Masked face recognition
2.0 - What is Deep Learning?
Artificial Intelligence is the umbrella term for anything that seems intelligent
Machine Learning - computer or machine learns from data
Deep Learning - A subset of Machine Learning that uses features in data.
Neurons connected by edges
Weights and features
2.1 - Image Recognition
Formerly called Artificial Neural Networks (2012)
ImageNet Large Scale Visualisation Recognition (ILSVRC)
1.2M images in 1K categories
Classification: Make 5 guesses about the image label
Huge jump in performance thanks to Deep learning in 2012.
Performance increases as the size of the neural net increases.
2.2 - Simple Neural Networks (Perceptron)
Weights determine the importance of each feature
Sum of weights×inputs applied to a non-linear function gives the output
Can be mathematically represented as:
y^=g(θ0+i=1∑mxiθi
2.3 - Building Blocks of Neural Networks
A fully connected neural network is a series of nodes where a node is connected to every single node in the next layer.
This is an example of a 3-layer NN (We don't count the input layer)
Can use weight sharing - if we have a distinct weight for every neuron-to-neuron connection, this grows exponentially.
Use weight sharing to reduce the total number of weights.
Can use pooling to downsample the input features, or outputs of each layer in the NN
2.4 - Deep NNs
Residual network: Connection between VGG-19 and plain network??
Each box has a number associated with it.
A DNN such as this has on the order of 1M weights to learn.
2.5 - Training Neural Networks
Learn the parameters/weights by optimising an objective function:
Minimise error in model's predictions f(x), compare to the true labels, y
Performed through backpropagation and gradient descent
We feed in inputs (as represented by the matrix x)
These inputs get multiplied by the weights {w1,w2,w3,w4,w5,w6} to determine the values at the nodes z1,z2,z3
The values at the nodes z1,z2,z3 are then multiplied by the weights {w7,w8,w9} to determine the final value y^1
Through all of this, we are trying to approximate a function f(x)
The amount of error will inform the DNN and will cause an update in the weights.
The objective function / cost function / empirical risk is a measure of the difference between what our network predicted vs the actual observed value.
J(θ)=n1i=1∑nL(f(x(i);θ),y(i))Predicted ValueActual Value
2.5.1 - Optimisation / Function Approximation
Suppose we have a series of weights and features θ=[wb].
We then compute the gradient using partial derivatives. ∂θ∂J(θ)
Then we update the weights θ←θ−α∂θ∂J(θ)
Non-linear function approximation is achieved by using a non-linear activation function
2.6 - Types of Machine Learning
Supervised Learning
Input: Data with labels {x,y}
Output: Learn mapping function x→y
Target used to determine error which is fed back into the supervised learning algorithm
Unsupervised learning
Input: Data without labels {x}
Output: Learn the underlying structure (e.g. classes)
Reinforcement Learning
Input: State-Action Pairs {s,a}
Output: Action that maximises expected future rewards
Evaluations lead to a reward encountered which is used as feedback in the RL algorithm
2.0 - Reinforcement Learning
Gather data as we are learning
Which data to use affects the learning performance
The agent has control over which data to use (agent does so in the form of Episodes)
Episode - Sequence of states from starting state to terminal state.
Agent performs an action At and observes feedback (in this case, in the form of the next state and reward received).
Agent is able to modify the action that it chooses to maximise the reward.
Since the reward is unknown, the agent must explore the environment.
2.1 - Deep RL
A combination of Deep Learning + Reinforcement Learning
Deep Learning: Feature representation learning
Reinforcement Learning: Exploration / interaction in an environment.
Earlier layers in the network learn more primitive features (such as lines, edges)
Deeper layers in the network learn more complex features (e.g. hexagonal structure, beak in the bottom left corner)
Deep RL - good for function approximators and feature representation learners
2.2 - Reinforcement Learning Recap
The value of a state (or action) is the expected sum of discounted future rewards
Reward at time step t is given by the discounted expected reward in future time steps (discounted by γ)
The ultimate goal of our agent is to maximise V (i.e. the expected value of successive rewards).
The equations V(s)=R(s)+γmaxaQ(s,a) and Q(s,a)=∑s′P(s′∣s,a)V(s′) are bellman equations
The relationship between V(s) and Q(s,a) is that the Q-value consist of state, action pairs whereas the value is just for a single state
The Q-function captures the expected total future reward an agent in state s can receive by executing a certain action a
We want to maximise the expected reward for a given state, and action
2.2.1 - Policy
Once we know the Q-values, extracting the optimal policy is easy - pick the action with the highest value
Note - state-values already incorporate the sum of discounted future rewards.
2.3 - Deep RL Algorithms
Deep RL algorithms can be broadly classified into two types of algorithms
Value Learning
Find Q(s,a)
a=argmaxaQ(s,a)
Estimate the values of each (s, a) pair
After the estimates have converged, find action
Action found by finding a with largest value for fixed state.
Policy Learning
Find π(s)
Sample a∼π(s)
Find policy - sample actions from probabilistic policy.
2.4 - Value Learning RL
Actions:
move_paddle_left
move_paddle_right
do_not_move_paddle
Rewards:
If ball hits brick, reward = 1
Otherwise, reward = 0
End condition
If ball falls off the screen, the game ends.
Can we learn to control an agent directly from sensor inputs?
In breakout, sensory input would be a game screen frame
Finding a state representation - do we know the direction of the ball? the velocity of the ball?
2.4.1 - Q-Value Approximation
Use a function to approximate the Q-function: Q(s;a;θ)≈Q∗(s,a)
Which of these state-actions is better? A or B?
Which has a higher Q(s,a)?
Hand-designing features w(s,a)
Performance depends on the quality of features
Not generalized
Doesn't scale well with game complexity
Handcrafting features is very difficult!
We want to learn features from pixels!
We can stack consecutive frames for each state to determine velocity and direction
However:
Assuming that there are 84×84 pixels per frame, where each pixel can take on 256 values with 4 frames
There are approximately 25684×84×4 states
Number of atoms in the universe on the order of 1078 to 1082
2.4.2 - DNNs as a Q-Value Approximator
To solve this, we can use a DNN as a Q(s,a) approximator
Input: State s, action a
Output: The expected return: Q(s, a)
Can we make this even more efficient?
Yes, we can set the state as the only input into the NN
NN outputs a Q-value for every possible action
Select action corresponding to the highest Q-value
A single network to predict Q(s, a) for all possible states and actions
2.4.3 - Training DQNs
Training Deep-Q-Networks can be done using the Bellman Equations
Q(s,a)=r+γmaxa′Q(s′,a′)
We also need a loss function to determine how well our network is performing
L=E[∣r+γmaxaQ(s′,a′)−Q(s,a)∣2]
r+γmaxaQ(s′,a′) is the target
Q(s,a) is the predicted value
Use the neural network to learn the Q-function and use it to infer the optimal policy.
Take an experience <s,a,r,s′> and update the Q-table as follows:
Feed in state as an input into the NN and get it to generate the Q-values for each (s, a) pair (Do a feedforward pass for the current state s to get predicted Q-values for all actions)
Let the action to be taken be the (s, a) pair with the largest Q-value (Do a feedforward pass for the next state s' and calculate the maximum overall network outputs maxa′Q(s′,a′))
Perform the action a' and compute the Q-values for s' (Set the Q-value target for action to r+γmaxa′Q(s′,a′))
Update the weights of the neural network using backpropagation
🧠 This is very similar to using TD target to make Q-Learning estimates closer to the actual value.
When training DQN naively, it doesn't always work that well
There are a few tricks that we can use to better train the weights
Experience Replay (store states, actions and rewards to update the reward)
Fixed Target Network
Reward Clipping (avoid extremes)
Skipping Frames
2.4.4 - DQN for Atari
A single algorithm that learns from pixels to extract a policy
Comparison of the DQN agent with the best RL methods in literature
The performance of the DQN is normalised with respect to a professional human games tester (that is, 100% level) and random play (that is 0% level)
DQN Doesn't Always Work Well
In Atari Pinball, DQN does very well - 2539% of human performance
In Atari Time Pilot, DQN is approximately at the level of human performance
In Atari Montezuma's Revence, DQN does much worse than humans performance.
Rewards are very sparse - the signals given to the algorithm don't occur very often.
Also rewards are very far away in the future.
2.5 - Downsides of Q-Learning
Does not work well if environment has sparse, delayed rewards
Does not work well in continuous action spaces - outputs Q-values for every state and action, cannot do this for continuous state spaces
Policy is deterministically computed from the Q-function by maximising the reward → cannot learn stochastic policies.
2.6 - Policy Gradient
DQN (off-policy): Approximates Q and infers optimal policy
PG (on-policy): Directly optimises policy, π(s)
Note here that we have a single-layer NN
Policy Gradient directly optimise the policy π(s)
Enables modelling of continuous action spaces.
State as an input into the NN
NN learns weights, outputs probability that we should perform action.
Policy is now a probability distribution of what action we should perform conditioned on the current state.
Sample from the outputs and choose the action with the highest probability.
un the policy for a while. See what actions led to high rewards and increase their probability.
This will handle the case where the rewards are sparse or far away.
2.6.1 - Training Policy Gradient
Initialise the agent
Run a policy until termination
Record all ⟨s,a,r⟩ (states, actions and rewards)
Decrease probability of actions that resulted in low reward
Increase the probability of actions that resulted in a high reward.
2.7 - Summary
Explain the explore-exploit dilemma and solutions to multi-armed bandit problems
Explain the relationship between decision-theoretic planning (MDPs) and reinforcement learning
Reinforcement Learning is an MDP with more unknowns (don't know transition function or reward model)
Explain the difference between model-based and model-free RL
Model-Based - Trying to learn the model
Model-Free - Estimating values for each state directly, Q-Learning and SARSA
Explain the difference between on-policy and off-policy RL
I.e. Q-Learning vs SARSA
Off-Policy - Exploration based off optimal policy
On-Policy - Uses policy to choose the next action
Implement and trace basic table-based RL algorithms - Q-Learning and SARSA
Given a table of every (s, a), estimate the values of each pair
Understand the limitations of state-based reinforcement learning and the use of feature-based RL algorithms including Deep RL
(s, a) pairs increase exponentially - when should we use feature-based RL instead?