COMP4702 Lecture 5

CourseMachine Learning
SemesterS1 2023

COMP4702 Lecture 5

  • Focused on the learning problem in machine learning which is almost always driven by optimisation

Principles of Parametric Modelling

  • General expression for regression model y=fθ(x)+εy=f_\theta({\bf{x}})+\varepsilon.
    • Attempting to model the relationship between inputs and outputs with a function that has parameters θ\theta.
    • Assume additive noise ε\varepsilon.
  • In the case of Linear Regression, we know:
    • The model output is a linear function of the parameters
    • The model can be trained in one step, with a closed-form solution using the least-squares error function
    • Assuming Gaussian noise, minimising SSE is equivalent to using the maximum likelihood principle
  • However, we want to train non-linear models, using various loss functions (and perhaps different noise assumptions)
  • Non-linear models are often used, where the model parameters have a particular meaning - however in ML this is usually not the case.
  • Conceptually can replace the linear function in linear regression, and the logit in logistic regression with some other non-linear function fθ(x)f_\theta({\bf{x}}) and things still fit in the maximum likelihood framework
  • To learn the model, we then need to solve an optimisation problem (Eq 5.4) as a proxy for what we really want to optimise (Eq 5.5)
  • Not just a simple matter of solving a numerical optimisation problem
    • Optimisation on training set beyond a certain point isn’t going to help - training set is finite sample
    • The evaluation function (error) might be different from the loss function used for training
      • Some ability to control model complexity (e.g., through explicit regularisation, such as L2L_2 )
    • Early stopping is one way to achieve good generalisation, by not allowing the optimisation to progress all the way to a solution

θ^=argminθ1ni=1nL(yi,fθ(i))loss functioncost function J(θ)(5.4)\hat\theta=\arg\min_\theta\underbrace{\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}\overbrace{(y_i,f_\theta({\bf}_i))}^{\text{loss function}}}_{\text{cost function } J(\theta)}\tag{5.4}

  • Equation 5.4 is the optimisation function in machine learning
    • The optimisation problem is quite tricky to solve, as the optimal solution isn’t exactly what we want
    • We really want to minimise the error on new, unseen data

Loss Functions and Likelihood-Based models

  • Different loss functions used in machine learning

  • Different properties, useful in different situation

  • Given specific information, some combinations of model + loss function make more sense than others

  • For regression, we have seen squared error L2L_2 which corresponds to maximum likelihood with Gaussian Noise assumption as shown in Equation 5.6

    L(y,y^)=(y^y)2(5.6)\mathcal{L}(y,\hat{y})=(\hat{y}-y)^2\tag{5.6}

    • Absolute error L1L_1 is more robust to outliers (as it grows more slowly for large errors) and can help with model complexity

    L(y,y^)=y^y(5.7)\mathcal{L}(y,\hat{y})=|\hat{y}-y|\tag{5.7}

    • However, may require different optimisation techniques
    • Additionally consider Huber loss and ϵ\epsilon-insensitive loss

    Figure 1 - Loss Functions for regression presented, as a function of error

  • For classification, misclassification loss is intuitive but not often directly used

    • One reason os that this results in a cost function that is piecewise constant - not friendly to numerical optimisation that use gradient info.
    • For a binary classifier that predicts conditional probabilities p(y=1x)p(y=1|{\bf{x}}) in terms of a function g(x)g({\bf{x}}) then optimise cross-entropy loss (through maximum likelihood framework)
  • Can also work with a measure of how close the (training) data points are to the decision boundary (aka margin)

  • If we can bring that information into the loss function, we can train a model to maximise the margin (e.g. SVMs)

  • If we think about logistic regression, then we end up making a (hard) class prediction by threshold-ing the logit θTx\theta^T\bf{x} to convert into class predictions -1 and +1.

y^(x)=sign{f(x)}=sign{θTx}\hat{y}(\bf{x})=\text{sign}\lbrace f({\bf{x}}) \rbrace = \text{sign} \lbrace {{\bf{\theta}}^T{\bf{x}}}\rbrace

  • The margin of a classifier for a data point (x,y)({\bf{x}},y) is yf(x)y \cdot f({\bf{x}})
  • There are several loss functions that include the concept of margin.
    • Some allow for a probabilistic interpretation of the prediction, some don’t

Figure 2 - Loss Functions for Classification

  • For multi-class classification, cross-entropy is straightforward, but loss functions that use margin are not.
  • In practice, two ways to tackle these problems:
    • One-vs-rest-train: Train MM binary classifiers, each one predicting one of the classes vs all the others
      • To predict a point, evaluate each classifier and use the prediction with the greatest margin
    • One-vs-one: Train a classifier to discriminate between every pair of classes (with 12M(M1)\frac{1}{2}M(M-1) pairs)
      • To predict a point, evaluate each classifier and take a majority vote. Use margin to break tie.

Note: We leave the next two sections of the book as optional reading

Regularisation

  • Regularisation: Some strategy for controlling modelling complexity.
    • Penalise models that are getting too complex.
  • Regularisation can be either implicit or explicit

Explicit Regularisation

  • Adding a penalty term to the cost function.
  • L2L^2 regularisation adds penalty term to the cost function that is the squared L2L^2-norm of the model parameters, with a hyperparameter λ0\lambda\ge0 to weight the influence of this term.
    • In Linear Regression, this is called Ridge Regression
    • In Neural Networks, this is called weight decay
  • We add the term λθ22\color{lightblue}\lambda || \theta ||_2^2 on to the end of the regularisation problem

θ^=argminθ1nXθy22+λθ22(5.22)\hat\theta=\arg\min_\theta\frac{1}{n} || {\bf X\theta-y}||_2^2 + \lambda || \theta ||_2^2\tag{5.22}

  • L1L^1 regularisation adds the L1L^1-norm (usually called the Manhattan distance / Lasso).
    • This is more complex to optimise, but has the advantage that it tends to favour sparse solutions where some of the models become (exactly) zero.
    • This is like feature selection for free!

θ^=argminθ1nXθy22+λθ1(5.25)\hat\theta=\arg\min_\theta\frac{1}{n} || {\bf X\theta-y}||_2^2 + \lambda || \theta ||_1\tag{5.25}

Implicit Regularisation

  • Implicit regularisation doesn’t add an explicit term to the cost function, but tries to control model complexity some other way.
  • We will see a few examples of explicit regularisation in this course:
    • Early stopping: If training means running an iterative optimisation, monitor Ehold-outE_\text{hold-out} and stop training when it achieves a minimum value
    • Dropout (for Neural Networks)
    • Data augmentation
    • Splitting criterion in decision trees? (Model hyperparameter can also have a regularisation effect.)

Parameter Optimisation

  • Certain types of optimisation are heavily used in machine learning

  • Most important of optimisation techniques are numerical optimisation techniques, that use gradient information to try and find a good solution

  • There are other techniques not mentioned in the textbook, including derivative free methods, metaheuristics (e.g., evolutionary algorithms), linear programming, integer programming.

  • The textbook highlights two types of optimisation problems:

    1. Training a model
    2. Hyperparameter optimisation (tuning)

Examples of Objective Functions

Figure 3 - Example of Objective Functions plotted in 3d

  • The goal of training our model is attempting to find the θ\theta values that allow the objective function to be minimised.

  • The function shown on the left is a convex function, and it has a unique global minimum (simple function to optimise)

  • The function shown on the right has three local minima, one of which is the global minimum (more complex function to optimise)

  • We need an optimisation function or algorithm that can (iteratively) search the surfaces defined by the functions, seeking out the global minimum

    • Note that these functions cannot see the entire surface as we are able to see here - realistically, they are only able to see one point at a time.
  • The contour plots of the functions shown above are given below.

    Figure 4 - Contour plots of functions shown in Figure 3

  • Convex functions are easier to optimise from a machine learning perspective

    • Some convex functions have closed-form solutions (e.g., linear regression) but not all do (e.g. logistic regression)
    • In the case where we don’t have a closed-form solution, we require an iterative algorithm to solve it.

Gradient Descent

  • One of the oldest optimisation techniques.
  • Used in cases where we have functions that we can calculate the gradient of.
  • Gradient descent iterates by taking a step in the direction of the negative gradient.

(A note on the optimisation of parameters from first principles)

  • For linear regression with squared error loss, training the model amounts to solving the following equation (given that XTX{\bf{X}}^T\bf{X}) is invertible

θ^=argminθ1nXθy22(3.12)\hat\theta=\arg\min_\theta\frac{1}{n} ||{\bf{X}} \theta-{\bf{y}}||_2^2 \tag{3.12}

  • We can efficiently implement the optimisation algorithm for some of the loss functions in the case of linear regression.
  • However, this is not always the case, and for a lot of cases, iterative solutions must be found.
  • The generalised parameter learning algorithm is given as

θ^=argminθJ(θ)(5.28)\hat\theta=\arg\min_\theta J(\theta) \tag{5.28}

  • We can use gradient descent to (iteratively) learn parameter vectors θ\theta when the objective function J(θ)J(\theta) is simple enough such that its gradient is possible to compute.
  • Assume that the gradient of the cost function θJ(θ)\nabla_\theta J(\theta) exists for all θ\theta.
  • For example, the gradient of the cost function for logistic regression (Eq 3.34) is given as

θJ(θ)=1ni=1n(11+eyiθTxi)yixi(5.29)\nabla_\theta J(\theta)=-\frac{1}{n}\sum_{i=1}^{n} (\frac{1}{1 + e^{y_i \theta^T {\bf{x}_i}}}) y_i {\bf{x}}_i\tag{5.29}

  • Note that θJ(θ)\nabla_\theta J(\theta) is a vector of the same dimension as θ\theta that describes the direction in which J(θ)J(\theta) increases.

Gradient Descent Algorithm

  • The gradient descent algorithm is given as:
  • Input: Objective function J(θ)J(\theta), initial θ(0)\theta^{(0)} and learning rate γ\gamma
  • Result: θ^\hat{\theta}
  1. Set t0t\leftarrow 0
  2. While θ(t)θ(t1)||\theta^{(t)} - \theta^{(t-1)}|| not small enough, do
  3.     Update θ(t+1)θ(t)γθJ(θ(t))\theta^{(t+1)}\leftarrow\theta^{(t)}-\gamma\nabla_\theta J(\theta^{(t)})
  4.     Update tt+1t\leftarrow t+1
  5. end
  6. return θ^θ(t1)\hat\theta\leftarrow\theta^{(t-1)}
  • Note To use the gradient descent technique we need an objective function and an initial guess for our parameters

  • The key to this algorithm is step (3).

    • We update the new parameter vector using the old parameter vector, moving in the direction of the negative gradient θJ(θ(t))-\nabla_\theta J(\theta^{(t)}) by a distance proportional to γ\gamma.

    Figure 5 - Effect of learning rate on finding global minima

  • In the figure shown above, we can see the effect of different learning rates.

    • In (a) We see that the learning rate is too low, and approaches the minima too slowly
    • In (b) We see that the learning rate is too high and the function does not reach the minima efficiently.
    • In © We see a good learning rate for the problem
  • The size of the learning rate γ\gamma is very much dependent on the shape of the cost function.

    • In this example γ=0.3\gamma=0.3 is a good value, but this may be too small (or too large) for another cost function.
  • Observe that in (b) that as the gradient increases, the step size increases.

    • The distance moved is the product of the learning rate γ\gamma and the local gradient
  • Likewise in © as you approach the minima, the gradient decreases and the step size decreases

  • Within Line (3) of the algorithm sometimes people perform an internal optimisation problem solved at each iteration to find the optimal γ\gamma.

    • This is a so-called line-search problem

Gradient Descent Example

Example 5.5

Figure 6 - Convex function

  • In the example above, we see two attempts (green and blue lines) to reach some global minima.

  • Both attempts reach the global minimum

    Figure 7 - Non-Convex function

  • However, in Figure 7, we see attempts to find the global minimum on two non-convex functions

  • The optimiser doesn’t always find the global minima (in the case of the blue line)

  • Furthermore, in the image to the right, the learning rate is too large, and the algorithm doesn’t converge (where we observe strange oscillatory behaviour).

Other Methods

  • Second-order gradient methods (e.g. Newton’s method) use second derivative (curvature) information to try to improve the optimisation performance
    • Add considerable complexity and don’t always produce good results.
  • Example 5.7 shows us how early stopping can work, and reminds that we don’t actually want the best solution (global optimum) of our training problem.

Early Stopping with Logistic Regression

An example of regularisation by early stopping

  • Consider the music classification problem (Example 2.1)
  • Apply multi-class logistic regression with a 20-degree polynomial input transformation which gives a model denoted

θ0+θ1x1+θ2x2+θ3x12+θ4x1x2+θ5x22++θ299x1x219+θ230x2230\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_1x_2 + \theta_5 x_2^2 + \cdots + \theta_{299} x_1 x_2^{19} + \theta_{230} x_2^{230}

  • This model has 231 parameters, and with a limited amount of training data, this will likely lead to over-fitting if no regularisation is used.

    Figure 8 - Hold-out error

  • Figure 8 shows the hold out error for the multi-class classification problem

  • Whilst it doesn’t show the loss of the model per iteration, we would expect the loss function to decrease over time.

  • The hold-out error goes down but then back-up

    • The minimum hold-out error occurs at Iteration 12, which suggests that after that point, over-fitting occurs and we should probably hav estopped training at that point.

Optimisation with Large Datasets

  • When training ML models, objective function usually contains a term that is the sum over all data-points in the training set.
  • Important idea - being able to approximate using a subsample of the data-points
  • One approximation for this is stochastic gradient descent
    • Can get a good enough approximation of the gradient from a subsample (mini-batch) of the dataset
    • Greater computational efficiency
    • Controversial argument that the noise is able to help the algorithm escape bad solutions during training.
  • if each loop of gradient descent updates on a mini-batch the once we have worked through the entire dataset, it is called an epoch.
  • The equations for stochastic gradient descent are given as shown in the equations below, where each mini-batch is half of the dataset.

θ(t+1)=θ(t)γ1n/2i=1n2θL(xi,yi,θ(t))(5.37a) \theta^{(t+1)}=\theta^{(t)} - \gamma \frac{1}{n/2} \sum_{i=1}^{\frac{n}{2}} \nabla_\theta\mathcal{L}({\bf{x}}_i, {\bf{y}}_i, \theta^{(t)})\tag{5.37a}

θ(t+2)=θ(t+1)γ1n/2i=n2+1nθL(xi,yi,θ(t+1))(5.37b) \theta^{(t+2)}=\theta^{(t+1)} - \gamma \frac{1}{n/2} \sum_{i=\frac{n}{2}+1}^{n} \nabla_\theta\mathcal{L}({\bf{x}}_i, {\bf{y}}_i, \theta^{(t+1)})\tag{5.37b}

  • The algorithm for stochastic gradient descent is largely the same as gradient descent, but introduces an additional loop to deal with the mini-batches

    • Iterations are renamed to epochs
  • In Neural Networks and deep learning, there have been many variations of Stochastic Gradient descent algorithms for training.

    • Many of these algorithms include different techniques for adjusting the step size (learning rate) during training
  • The ADAM optimiser is the most widely-used optimiser in deep learning

    • It combines gradient and learning rate values from previous

Hyperparameter Optimisation

  • As we employ more ML techniques, we end up with more hyper-parameters that we need to find the optimal value of

  • The hyper-parameters of a ML model are typically a collection of different terms that are probably a mix of discrete and continuous values.

  • Trying all combinations of several different values for each parameter (grid search) is common but expensive if we have a ot of hyper-parameters (Example 5.9 and Algorithm 5.4).

  • This is where black-box or meta-heuristic optimisation algorithms have been successfully applied (e.g., AutoML)