COMP4702 Chapter

CourseMachine Learning
SemesterS1 2023

COMP4702 Chapter

A summary of Lindholm Chapter 3

Note With Linear Regression, guaranteed to get the optimal solution (can be calculated in a single step using the closed form) but Decision Trees not guaranteed to get optimal solution (might get it by luck but not guaranteed)

Linear Regression

  • Regression Learning the relationship between input variables x=[x1  x2    xp]T\bf{x}=[x_1\ \ x_2\ \ \cdots\ \ x_p]^T and a numerical output variable yy.

  • Inputs can either be categorical or numerical.

  • Goal is to learn a mathematical model ff:

y=f(x)+ε(3.1)y = f(\bf{x}) + \varepsilon \tag{3.1}

  • This function maps the input x\bf{x} to the output yy
    • ε\varepsilon is the error term that describes everything about the input-output relationship that cannot be captured by the model.
    • Consider ε\varepsilon as a random variable, noise.

Linear Regression Model

  • The linear regression model assumes that the output variable yy can be described as an affine [1] combination of the pp input variables plus a noise term ε\varepsilon.

y=θ0+θ1x1+θ2x2++θpxp+ε(3.2)y=\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_p x_p + \varepsilon \tag{3.2}

  • Refer to the coefficients θ0,θ1,,θp\theta_0, \theta_1, \cdots, \theta_p as parameters of model.
  • Refer to θ0\theta_0 specifically as the intercept or offset term of the model.
  • The noise term ε\varepsilon accounts for random errors in the data not captured by the model.
    • Assumed that this noise is assumed to have a mean of 0 and to be independent of x\bf{x}.
  • A more compact representation of Eq. 3.2 is achieved by introducing a parameter vector, θ=[θ0  θ1    θp]T\bf{\theta}=[\theta_0\ \ \theta_1\ \ \cdots\ \ \theta_p]^T and extend the vector with a constant 11 in the first position:

y=θ0+θ1x1+θ2x2++θpxp+ε=[θ0θ1θp]+[1x1xp]+ε=θTx+ε(3.3)\begin{align*} y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_p x_p + \varepsilon = \begin{bmatrix} \theta_0 & \theta_1 & \cdots & \theta_p \end{bmatrix} + \begin{bmatrix} 1\\x_1\\\vdots\\x_p \end{bmatrix} + \varepsilon = \theta^T {\bf{x}} + \varepsilon \end{align*} \tag{3.3}

Note In this context, x\bf{x} is used to denote the input vector matrix, both with and without the leading constant 1.

Additionally note that ε\bf{\varepsilon} is a pp-vector, of error / noise terms.

  • Linear regression as in Eq 3.3 is a parametric function, in which the parameters θ\bf{\theta} can take any arbitrary values.

  • The values that we assign to the parameters θ\bf{\theta} control the input-output relationship described by the model.

  • The learning of the model is then attempting to find suitable values of θ\bf{\theta} based on observed training data.

  • Goal of supervised machine learning is to make predictions y^(x)\hat{y}(\bf{x_\star}) for new, previously unseen test input x=[1  x1  x2  xp]T{\bf{x_\star}} = [1 \ \ x_{\star 1}\ \ x_{\star 2}\ \ \cdots x_{\star p}]^T.

    • Suppose parameter values θ^\hat{\bf{\theta}} have already been learned (here ^\hat{} denotes that θ^\hat{\bf{\theta}} contains learned values of the unknown parameter vector θ\bf{\theta})

[1] Affine Linear function plus constant offset

Figure 1 - Linear Regression with p=1. Black dots represent data-points, blue line represents learned regression model. Model doesn’t fit data perfectly, so remaining error corresponding to random noise ε for each data-point is shown in red. The model can be used to predict (blue circle) the output ŷ=𝐱★ for a test input 𝐱★

  • Since we assume that noise term ε\varepsilon is random with zero-mean and independent of all variables, it makes sense to set ε=0\varepsilon=0 for prediction.
  • A prediction from a linear regression model is of the general form:

y^(x)=θ^0+θ^1x1+θ^2x2++θ^pxp=θ^ Tx(3.4)\hat{y}(\bf{x_\star})=\hat{\theta}_0 + \hat{\theta}_1 x_{\star 1} + \hat{\theta}_2 \bf{x}_{\star 2} + \cdots + \hat{\theta}_p x_{\star p} = \hat{\bf{\theta}}^{\ T} \bf{x_\star} \tag{3.4}

Training a Linear Regression Model

  • Want to learn θ\bf{\theta} from training data T={x_i,y_i}i=1n\mathcal{T}=\{\bf{x}\_i, y\_i\}_{i=1}^{n} with nn data-points with inputs xi\bf{x}_i and output yiy_i in the n×(p+1)n\times (p + 1) matrix X\bf{X} and nn-dimensional vector y\bf{y}.

X=[x1T  x2T    xnT],y=[y1  y2    yn]T(3.5){\bf{X}}=[\bf{x}_1^T\ \ \bf{x}_2^T\ \ \cdots \ \ \bf{x}_n^T], {\bf{y}}=[y_1\ \ y_2\ \ \cdots \ \ y_n]^T\tag{3.5}

  • Where each xi=[1  xi1  xi2    xip]T{\bf{x}}_ i=[1\ \ x_ {i1} \ \ x_ {i2}\ \ \cdots \ \ x_ {ip}]^T

Defining a Loss Function

  • We introduce a loss function, as a way to measure how similar or well our model fits the training data, denoted L(y^,y)L(\hat{y}, y) which measures how close the model’s prediction y^\hat{y} matches the observed data yy.

  • If the model fits the data well (y^y\hat{y}\approx y), then the loss function should be small, and vice versa.

  • Also define the cost function as the average loss over the training data.

  • Training a model amounts to finding the parameter values that minimise the cost:

θ^=argminθ1ni=1nL(y^(xi;θ),yi)(3.9)\hat\theta = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n L(\hat{y}(\bf{x}_i; \theta), y_i)\tag{3.9}

  • Where:
    • L(y^(xi;θ),yi)L(\hat{y}(\bf{x}_i; \theta), y_i) is the loss function
    • i=1nL(y^(xi;θ),yi)\sum_{i=1}^n L(\hat{y}(\bf{x}_i; \theta), y_i) is the cost function.
  • Note that each term in the expression above corresponds to evaluating the loss function for the prediction y^(xi;θ)\hat{y}(\bf{x}_i; \theta)
  • argminθ\arg\min_\theta effectively means “the value of θ\bf\theta for which the cost function is the smallest”

Least Squares and the Normal Equations

  • For regression, a commonly used loss function is the squared error loss:

L(y^(x;θ),y)=(y^(x;θ)y)2(3.10)L(\hat{y}(\bf{x}; {\bf{\theta}}), y)=(\hat{y}({\bf{x};{\bf{\theta}}})-y)^2 \tag{3.10}

  • This loss function is 0 if y^(x;θ)=y\hat{y}({\bf{x}};{\bf{\theta}})=y and grows quadratically a the difference between yy and the prediction increases.
  • The corresponding cost function for the linear regression model, shown in Eq 3.7 can be written using matrix notation as:

J(θ)=1ni=1n(y^(xi;θ)yi)2=1ny^y22=1nXθy22=1nϵ22(3.11)\begin{align*}J(\theta)&=\frac{1}{n} \sum_{i=1}^{n} (\hat{y}({\bf{x}}_i ; \theta) - y_i)^2 \\ &= \frac{1}{n} ||\hat{\bf{y}} - {\bf{y}} ||_2^2 \\ &= \frac{1}{n} ||{\bf{X}}\theta-{\bf{y}}||_2^2\\ &= \frac{1}{n} ||\epsilon||_2^2\end{align*}\tag{3.11}

  • Where 2||\cdot||_2 denotes Euclidean vector norm, and 22||\cdot||_2^2 denotes its square.
  • This function is known as the least squares cost
  • When using the squared error loss for learning a linear regression model from T\mathcal{T}, we need to solve (Eq 3.12)

&nbsp

θ^=argminθ1ni=1n(θTxiyi)2=argminθ1nXθy22\begin{align*}\hat{\theta} &= \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^n (\theta^T {\bf{x}}_i - y_i)^2 \\ &=\arg\min_\theta \frac{1}{n} || {\bf{X}}\theta-{\bf{y}}||_2^2 \tag{3.12}\end{align*}

  • We can denote this using Linear Algebra, in which we want to find the closest vector to y\bf{y} in a Euclidean sense in the subspace Rn\mathbb{R}^n spanned by the columns of X\bf{X}.

XTXθ^=XTy(3.13){\bf{X}}^T {\bf{X}}\hat{\theta} = {\bf{X}}^T {\bf{y}}\tag{3.13}

Figure 2 - Graphical Representation of Squared Error Loss Function.

  • Goal is to choose the model (blue line) such that the sum of squares (denoted in light red) of each error ε\varepsilon
  • Equation 3.13 is often referred to as the normal equation and gives the solution to the least squares problem (eq 3.12).
  • If XTX{\bf{X}}^T\bf{X} is invertible (which is often the case), then θ^\hat{\bf{\theta}} has the closed form expression:

θ^=(XTX)1XTy(3.14)\hat{\bf{\theta}} = ({\bf{X}}^T{\bf{X}})^{-1} {\bf{X}}^T y \tag{3.14}

  • The fact that this closed-form solution exists is important and is probably the reason for why linear regression + squared error loss is so common in practice.
  • Other loss functions lead to optimisation problems that often lack closed-form solutions.

Goal Learn linear regression using squared error loss.

Data Training data T={xi,yi}i=1n\mathcal{T} = \lbrace {\bf{x}_i , y_ i}\rbrace_{i=1}^{n}Result Learned parameter vector, θ\theta

  1. Construct the matrix X\bf{X} and vector y\bf{y} according to Eq 3.5
  2. Compute θ^\hat{\bf{\theta}} by solving Eq 3.13

Goal Predict with linear regression Data Learned parameter vector θ\bf{\theta} and test input x\bf{x_\star}Result Prediction y^x\hat{y}{\bf{x_\star}}

  1. Compute y^(x)=θ^Tx\hat{y}({\bf{x_ \star}}) = \hat{\bf{\theta}}^T \bf{x_ \star}

Maximum Likelihood Perspective (Derivation of Least-Squares)

  • We can get another perspective on the least squares method as a maximum likelihood solution
  • In this context, the word likelihood refers to the statistical concept of the likelihood function
    • Maximising the likelihood function amounts to finding the value of θ\theta that makes observing yy as likely as possible.
  • That is, instead of arbitrarily selecting a loss function, we start with the problem:

θ^=argmaxθp(yX;θ)(3.15)\hat{\theta} = \arg \max_\theta p({\bf{y}} | {\bf{X}} ; \theta)\tag{3.15}

  • Here p(yX;θ)p({\bf{y}} | {\bf{X}}; \theta) is the probability density of all observed outputs y\bf{y} in the training data, given all inputs X\bf{X} and parameters θ\theta.

  • In defining what “likely” means mathematically, we consider the noise term ε\varepsilon as a stochastic variable with certain distribution

  • A common assumption is that the noise terms are independent, each with a Gaussian (normal) distribution with mean zero and variance σε2\sigma_\varepsilon^2

ε N(0,σε2)(3.16)\varepsilon ~ \mathcal{N}(0, \sigma_\varepsilon^2) \tag{3.16}

Stochastic Having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely.

  • This implies that the nn observed data points are independent, and p(yX;θ)p({\bf{y}} | {\bf{X}} ; \theta) factorises as:

p(yX;θ)=i=1np(y1xi,θ)(3.17) p({\bf{y}} | {\bf{X}};\theta)=\prod_{i=1}^n p(y_1 | {\bf{x}}_i, \theta)\tag{3.17}

  • Consider the linear regression model from (3.3), y=θTx+εy=\theta^T {\bf{x}} + \varepsilon together with the gaussian noise assumption (3.16) yields the following:

p(yixi,θ)=N(yi;θTxi,σε2)=12πσε2exp(12σε2(θTxiyi)2)(3.18)\begin{align*} p(y_i | {\bf{x}}_i, \theta)&=\mathcal{N}(y_i;\theta^T {\bf{x}}_i, \sigma_\varepsilon^2)\\ &=\frac{1}{\sqrt{2\pi\sigma_\varepsilon^2}} \exp({-\frac{1}{2\sigma_\varepsilon^2} (\theta^T {\bf{x}_i} - y_i)^2}) \end{align*}\tag{3.18}

  • Recall that we want to maximise the likelihood with respect to θ\theta. For numerical reasons, it is usually better to work with the logarithm of p(yX;θ)p({\bf{y}} | {\bf{X}}; \theta).

lnp(yX;θ)=i=1nlnp(yixi,θ)(3.19)\ln p ({\bf{y}} | {\bf{X}} ; \theta) = \sum_{i=1}^n \ln p (y_i | {\bf{x}}_i, \theta) \tag{3.19}

  • Since the algorithm is a monotonically increasing function, maximising the log-likelihood (Equation 3.19) is equivalent to maximising the likelihood together. Combining Eq 3.18 and 3.19 yields:
\ln p({\bf{y}} | {\bf{X}} ; \theta) = -\frac{n}{2}\ln(2\pi\sigma_\varepsilon^2) \sigma{i=1}^n (\theta^T {\bf{x}}_i-y_i)^2 \tag{3.20}
  • Removing the terms and factors independent of θ\theta does not change the maximising argument. We can we-write 3.15 as:
\color{gray}\hat{\theta} = \arg \max_\theta p({\bf{y}} | {\bf{X}} ; \theta)\tag{3.15}
\begin{align*}
\hat{\theta} 
&=\arg\max_\theta -\sum_{i=1}^n(\theta^T {\bf{x}}_i-y_i)^2\\
&=\arg\min_\theta\frac{1}{n} \sum_{i=1}^n (\theta^T {\bf{x}}_i - y_i)^2\tag{3.21}
\end{align*}
  • We have just derived the equation for linear regression with the last-squares cost (the cost function implied by the squared error loss function, Equation 3.10).
  • Hence, using the squared error loss is equivalent to assuming a Gaussian (normal) noise distribution in the maximum likelihood formulation
  • Other assumptions on ε\varepsilon lead to other loss functions.

Categorical Input Variables

  • The regression problem is characterised by a numerical output yy and inputs of x\bf{x} of arbitrary type
  • We can handle categorical inputs by first assuming that we have an input variable that takes two different values.
  • We refer to these two values as A\text{A} and B\text{B} respectively, and then create a dummy variable xx as:
x=\begin{cases}
0&\text{if A}\\
1&\text{if B}\\
\end{cases}\tag{3.22}
  • We use this variable in any supervised machine learning method as if it was numerical.
  • For Linear Regression, this effectively gives is a model which looks like:
y=\theta_0 + \theta_1x + \varepsilon=
\begin{cases}
\theta_0+\varepsilon&\text{if A}\\
\theta_0 + \theta_1 + \varepsilon&\text{if B}\\
\end{cases}\tag{3.23}
  • The model is thus able to learn and predict two different values depending on whether the input is AA or BB.
  • If the categorical variable takes more than two values, say {A, B, C, D}\{\text{A, B, C, D}\} we can make a so-called one-hot encoding by constructing a four-dimensional vector:
{\bf{x}} 
  = \begin{bmatrix}
    x_A & x_B & x_C & x_D
    \end{bmatrix}^T
\tag{3.24}
  • In which xA=1x_A=1 if A\text{A} and xB=1x_B=1 if B\text{B} and so on
    • Only one element if x\bf{x} will be 1, with the rest being 0
    • This can be used for any type of supervised machine learning methods, not just linear regression

Classification and Logistic Regression

  • Can modify the linear regression model to apply it to the classification
  • Comes at the cost of not being able to use convenient normal equations, and have to resort to numerical optimisation / iterative learning algorithms

Statistical View of Classification

  • Supervised Machine Learning amounts to predicting the output from the input
  • Classification amounts to predicting conditional class probabilities, as in Eq 3.25
p ( y = m | {\bf{x}})\tag{3.25}
  • p(y=mx)p(y=m|{\bf{x}}) describes the probability for class m given that we know the input x\bf{x}.
  • The notation p(yx)p(y|{\bf{x}}) implies that we think about the class label yy as a random variable.
  • Because we choose to model the real world, where data originates as involving a certain amount of randomness (much like the random error ε\varepsilon in regression).

Example - Voting Behaviour using Probabilities

  • Want to construct a model that can predict voting preferences for different population groups.
  • Have to face the fact that not everyone in a certain population group will vote for the same political party.
  • Can therefore think of yy as a random variable which follows a certain probability distribution
  • If we know that the vote count in the group of 45 year old women x\bf{x} is 13% of the cerise party, 39% for the turquoise party and 48% for the purple party, we could describe it as:
\begin{align*}
  p(y=\text{cerise party} | {\bf{x}}=\text{46 year old women}) = 0.13\\
  p(y=\text{turquoise party} | {\bf{x}}=\text{46 year old women}) = 0.39\\
  p(y=\text{purple party} | {\bf{x}}=\text{46 year old women}) = 0.48
\end{align*}
  • In this way the probabilities p(yx)p(y|\bf{x}) describes the non-trivial fact that:
    1. All 45 year old women do not vote for the same party,
    2. The choice of party does not appear to be completely random among 45 year old women either - the purple party is the most popular, and the cerise party ist he least popular.
  • This, it is useful to have a classifier which predicts not only a class y^\hat{y} but a distribution over classes p(yx)p(y | {\bf{x}})

  • From the above example, we see the utility of predicting a probability distribution instead of just a single class.
  • For binary classification problems (M=2M=2) where y{1,1}y\in\{-1, 1\} we train a model g(x)g({\bf{x}}) for which:
p(y=1 | {\bf{x}}) \text{ is modelled by } g({\bf{x}}) \tag{3.26a}
  • By the laws of probabilities, it holds that p(y=1x)+p(y=1x)=1p(y=1| {\bf{x}}) + p(y=-1 | {\bf{x}}) = 1 which means that
p(y=-1|{\bf{x}})\text{ is modelled by } 1-g({\bf{x}}) \tag{3.26b}
  • Since g(x)g({\bf{x}}) is modelled for a probability, it is natural to require that 0g(x)10\le g({\bf{x}}) \le 1 for any x\bf{x}.
  • For the multi-class problem, we instead let the classifier return a vector-valued function g(x)\bf{g(x)} where:

[p(y=1x)p(y=2x)p(y=Mx)] is modelled by [g1(x)g2(x)gM(x)]=g(x)(3.27)\begin{bmatrix} p(y=1|{\bf{x}})\\ p(y=2|{\bf{x}})\\ \vdots\\ p(y=M|{\bf{x}})\\ \end{bmatrix} \text{ is modelled by } \begin{bmatrix} g_1({\bf{x}})\\ g_2({\bf{x}})\\ \vdots\\ g_M({\bf{x}}) \end{bmatrix} =\bf{g(x)}\tag{3.27}

  • Each element gm(x)g_m({\bf{x}}) of g(x)\bf{g(x)} corresponds to the conditional class probability p(y=mx)p(y=m|{\bf{x}}).
  • Since g(x)\bf{g(x)} models a probability vector, we require that each element gm(x)0g_m(\bf{x}) \ge 0 and g(x)1=m=1Mgm(x)=1||\bf{g(x)}||_1 =\sum_{m=1}^M |g_m(\bf{x})|=1 for any x\bf{x}.

Logistic Regression Model for Binary Classification

  • Logistic Regression is a modification of the linear regression model so that it fits the classification problem.
  • Begin with the case for binary classification, in which we wish to learn a function g(x)g({\bf{x}}) that approximates the conditional probability of the positive class.
  • We begin with the linear regression model, which is given by the following equation (with noise term removed):

z=θ0+θ1x1+θ2x2++θpxp=θTx(3.28)z = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_p x_p =\theta^T \bf{x}\tag{3.28}

  • This is a mapping that takes x\bf{x} as input, and returns zz, which in this context is called the logit
  • Note that zRz\in\mathbb{R}, whereas we need a function which instead returns a value in the interval [0,1][0,1].
  • The key idea is to use linear regression, and to squeeze zz from Eq 3.28 to the interval [0,1][0,1] using the logistic function h(z)=ez1+ezh(z)=\frac{e^z}{1+e^z} which is plotted in the figure below.

Figure 3  - Plot of the logistic function, given by h(z)=(e^z)/(1+e^z)$

  • Substituting the logistic function into our equation for zz gives:

 

g(x)=eθTx1+eθTx(3.29a)g({\bf{x}}) = \frac{e^{\theta^T {\bf{x}}}}{1 + e^{\theta^T {\bf{x}}}}\tag{3.29a}

  • Equation 3.29a is restricted to [0,1][0,1] and hence can be interpreted as a probability.
  • The function 3.29a is the logistic regression model for p(y=1x)p(y=1|{\bf{x}}).
  • Note that this equation also implicitly gives a model for p(y=1x)p(y=-1 | {\bf{x}}).

p(y=1x)=1g(x)=1eθTx1+eθTx=11+eθTx=eθTx1+eθTx(3.29b)p(y=-1 | {\bf{x}}) = 1 - g({\bf{x}}) = 1 - \frac{e^{\theta^T {\bf{x}}}}{1 + e^{\theta^T {\bf{x}}}} = \frac{1}{1+e^{\theta^T {\bf{x}}}} = \frac{e^{-\theta^T {\bf{x}}}}{1 + e^{-\theta^T {\bf{x}}}} \tag{3.29b}

  • Fundamentally, the logistic regression model is the linear regression model with the logistic scaling function to scale the output at the end.
  • Since logistic regression is a model for classification, we omit the loss term ε\varepsilon.
  • The randomness in the classification data is modelled by the class probability class construction p(y=mx)p(y=m | {\bf{x}}) instead of additive noise ε\varepsilon

Training Logistic Regression Model by Maximum Likelihood

  • Whilst the addition of the logistic function allows us to use linear regression for binary classification, it means that we cannot use the normal equations for learning θ\theta due to the non-linearity of the logistic function
  • Therefore, to train the model, we learn from the training data T={xi,yi}i=1n\mathcal{T}=\lbrace {\bf{x}}_i,y_i\rbrace_{i=1}^n using the maximum likelihood approach
  • Using this Maximum Likelihood approach, learning a classifier amounts to solving the following equation:

θ^=argmaxθp(yX;θ)=argmaxθi=1nlnp(yixi;θ)(3.30)\hat{\theta}=\arg\max_\theta p({\bf{y}} | {\bf{X}} ; \theta) = \arg\max_\theta \sum_{i=1}^n \ln p(y_i | {\bf{x}}_i; \theta)\tag{3.30}

  • Equation 3.30 is similar to the training solution for linear regression, in which we assume that all training data-points are independent, and we consider the logarithm of the likelihood function for numerical reasons
  • Added θ\theta explicitly to the notation to emphasise dependence on model parameters.
  • Our model of p(y=1x;θ)p(y=1|{\bf{x}};\theta) is g(x;θ)g({\bf{x}};\theta), which means that we can re-write the log-likelihood component as follows:

lnp(yixi;θ)={lng(xi;θ)if yi=1ln(1g(xi;θ))if yi=1(3.31)\ln p(y_i|{\bf{x}_i};\theta)= \begin{cases} \ln g({\bf{x}}_i;\theta)&\text{if }y_i=1\\ \ln(1-g({\bf{x}}_i;\theta))&\text{if }y_i=-1 \end{cases}\tag{3.31}

  • We can turn the maximisation problem into a minimisation problem by using the negative log-likelihood as a cost function.  

J(θ)=1nlnp(yixi;θ)=1ni=1n{lng(xi;θ)if yi=1ln(1g(xi;θ))if yi=1Binary cross-entropy loss L(g(xi;θ),yi)(3.32)\begin{align*} J(\theta) &=-\frac{1}{n}\sum\ln p(y_i|{\bf{x}}_i;\theta)\\ &=\frac{1}{n}\sum_{i=1}{n} \underbrace{ \begin{cases} -\ln g({\bf{x}}_i;\theta)&\text{if }y_i=1\\ -\ln (1-g({\bf{x}}_i;\theta))&\text{if }y_i=-1\\ \end{cases}}_{\text{Binary cross-entropy loss }\mathcal{L}(g({\bf{x}_i;\theta}),y_i)} \end{align*}\tag{3.32}

  • Note here that the loss that we have derived is the cross-entropy loss.
  • For the logistic regression model, we can re-write the cost function given in (Eq 3.32) in greater detail.
    • We consider the case where the binary classes are labelled as {1,1}\lbrace-1,1\rbrace.
  • For yi=1y_i=1, we write:

g(xi;θ)=eθTxi1+eθTxi=eyiθTxi1+eyiθTxi(3.33a)g({\bf{x}}_i;\theta) =\frac {e^{\theta^T{\bf{x}_i}}} {1+e^{\theta^T{\bf{x}_i}}} =\frac {e^{y_i \theta^T {\bf{x}_i}}} {1+e^{y_i \theta^T {\bf{x}_i}}} \tag{3.33a}

  • And similarly for yi=1y_i=-1, we write:

g(xi;θ)=eθTxi1+eθTxi=eyiθTxi1+eyiθTxi(3.33b)g({\bf{x}}_i;\theta) =\frac {e^{-\theta^T{\bf{x}_i}}} {1+e^{-\theta^T{\bf{x}_i}}} =\frac {e^{y_i \theta^T {\bf{x}_i}}} {1+e^{y_i \theta^T {\bf{x}_i}}} \tag{3.33b}

  • Observe how in both cases we get the same expression
  • Therefore, we can write (Eq 3.32) compactly as:

J(θ)=1Ni=1nlneyiθTxi1+eyiθTxi=1ni=1nln11+eyiθTxi=1ni=1nln(1+eyiθTxi)Logistic loss L(xi,yi,θ)(3.34)\begin{align*} J(\theta)&=\frac{1}{N} \sum_{i=1}^{n}-\ln\frac{e^{y_i \theta^T {\bf{x}_i}}}{1+e^{y_i \theta^T {\bf{x}_i}}} =\frac{1}{n}\sum_{i=1}{n}-\ln \frac{1}{1 + e^{-y_i \theta^T {\bf{x_i}}}}\\ &=\frac{1}{n}\sum_{i=1}{n} \underbrace{\ln (1 + e^{-y_i \theta^T {\bf{x_i}}})} _{\text{Logistic loss } \mathcal{L}({\bf{x_i}}, y_i, \theta)} \end{align*} \tag{3.34}

  • The loss function L(x,yi,θ)\mathcal{L}({\bf{x}}, y_i, \theta) shown above (which is a special case of cross-entropy loss) is called the logistic loss (or binomial deviance).
  • Learning a logistic regression model then results to solving the following equation to find the optimal θ^\hat\theta

θ^=argminθ1ni=1nln(1+eyiθTxi)(3.35)\hat\theta=\arg\min_\theta\frac{1}{n}\sum_{i=1}{n}\ln(1+e^{-y_i \theta^T {\bf{x_i}}})\tag{3.35}

  • Unlike linear regression, logistic regression has no closed-form solution so must result to numerical optimisation instead.

Logistic Regression Predictions and Decision Boundaries

Thus far have developed a method for predicting the probabilities for each class for some test input x\bf{x}_\star.

  • Sometimes we just want the best class prediction the model can make without consideration for the class probabilities.
  • This is achieved by adding a final step to the logistic regression model which converts the predicted probabilities into class prediction
    • The most common approach is to let y^(x)\hat{y}({\bf{x}_\star}) be the most probable class
    • For the binary classification problem, this can be expressed as shown below.
    • Note that the decision at g(x)=0.5g({\bf{x}})=0.5 is arbitrary.

y^(x)={1if g(x)>r1if g(x)r(3.36)\hat{y}({\bf{x}_\star})= \begin{cases} 1 & \text{if } g({\bf{x}})>r\\ -1 & \text{if }g({\bf{x}})\le r \end{cases}\tag{3.36}

Logistic Regression Pseudocode

Learn binary logistic regression

Data: Training data T={xi,yi}i=1n\mathcal{T}=\lbrace{\bf{x}}_i, y_i\rbrace_{i=1}^{n} with output classes y={1,1}y=\lbrace-1,1\rbrace

Result Learned parameter vector θ^\hat\theta

  1. Compute θ^\hat{\theta} by solving (Eq 3.35) numerically

Predict with binary logistic regression

Data Learned parameter vector θ^\hat\theta and test input x\bf{x}_\star

Result Prediction y^(x)\hat{y}({\bf{x}_\star})

  1. Compute g(x)g({\bf{x}_\star}) using (Eq. 3.29)
  2. If g(x)>0.5g({\bf{x}_\star})>0.5 then return y^(x)=1\hat{y}({\bf{x}_\star})=1 else return y^(x)=1\hat{y}({\bf{x}_\star})=-1
  • In the general case, the decision threshold r=0.5r=0.5 is appropriate.
    • However, in some applications it can be beneficial to explore different thresholds
    • r=0.5r=0.5 minimises the misclassification rate, however, it is not always the most important aspect of a classifier
  • For example, it can be more important to mistake a healthy patient as sick (false positive) instead of predict a sick patient as healthy (false negative).
  • The addition of class prediction probabilities essentially allows the models to have an “I don’t know” option

Logistic Regression with Multiple Classes

  • We can generalise logistic regression to the multi-class problem, where M>2M>2.

  • One approach is to use the softmax function

  • For the binary classification problem we designed a model which return a single scalar value representing p(y=1x)p(y=1|{\bf{x}}).

  • For the multi-class problem, we have to instead return a vector-valued function whose elements are non-negative and sum to 1.

    • We initially use MM instances of (Eq 3.28, equation for simple linear regression) and denote each instance zmz_m, with its own set of parameters θm,zm=θmTx\theta_m, z_m=\theta^T_m{\bf{x}}.
    • We stack all instances of zmz_m into a vector of logits z=[z1z2zM]T{\bf{z}}=\begin{bmatrix}z_1&z_2&\cdots&z_M\end{bmatrix}^T and use the soft-max function to replace the logistic function
    • Essentially, use soft-max on the vector of logits instead of the logistic function on each model.

softmax(z)=Δ1m=1Mezm[ez1ez2ezM]T(3.41) \text{softmax}({\bf{z}})\overset{\Delta}{=} \frac{1}{\sum_{m=1}^{M} e^{z_m}} \begin{bmatrix} e^{z_1} & e^{z_2} & \cdots e^{z_M} \end{bmatrix}^T \tag{3.41}

  • The softmax function takes in a MM-dimensional vector, and returns a vector with same dimensionality.
    • The output vector from the softmax function always sums to 1, and each element is 0\ge 0.
  • We use the softmax function to model the class probabilities (similar to the use of the logistic function in the binary classification case).

g(x)=softmax(z),             where z=[θ1Txθ2TxθMTx](3.42){\bf{g}}({\bf{x}})=\text{softmax}({\bf{z}}), \ \ \ \ \ \ \ \ \ \ \ \ \ \text{where }\bf{z}= \begin{bmatrix} \theta_1^T{\bf{x}}\\ \theta_2^T{\bf{x}}\\ \vdots\\ \theta_M^T{\bf{x}}\\ \end{bmatrix} \tag{3.42}

  • We can also write out the individual class probabilities (i.e., elements of vector g(x)\bf{g(x)}) as:

gm(x)=eθmTxj=1MeθjTx,                m=1,,M(3.43)g_m({\bf{x}})=\frac{e^{\theta^T_m {\bf{x}}}}{\sum_{j=1}{M} e^{\theta^T_j {\bf{x}}}}, \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \forall m=1, \cdots, M \tag{3.43}

  • Note that this construction uses MM parameter vectors θ1,,θM\theta_1, \cdots, \theta_M, which means that the number of parameters that need to be learned grows as MM increases.

Training Logistic Regression with Multiple Classes

  • As with the binary logistic regression model, we can use the concept of maximum likelihood to train our model.
  • For this, we will use θ\theta to denote all model parameters, θ={θ1,,θM}\theta=\lbrace\theta_1,\cdots,\theta_M\rbrace.
  • Since gm(xi;θ)g_m({\bf{x}}_i; \theta) is our model for p(yi=mxi)p(y_i=m|{\bf{x}}_i), the cost function for the cross-entropy loss for the multi-class problem is given as:

J(θ)=1n]i=inlngyi(xi;θ)Multi-class cross-entropy loss(3.44) J(\theta)=\frac{1}{n}]\sum_{i=i}^{n} \underbrace{-\ln g_{y_i} ({\bf{x}}_i;\theta)}_{\text{Multi-class cross-entropy loss}}\tag{3.44}

  • Note that this multi-class cross-entropy loss is denoted L(g(xi;θ),yi)\mathcal{L}({\bf{g}}({\bf{x}}_i;\theta),y_i)
  • We can insert the model we developed in (Eq 3.43) into the loss function (Eq 3.44) to give the cost function to optimise for the multi-class logistic regression problem.

J(θ)=1ni=1n(θyiTxi+lnj=1MeθjTxi)(3.45)J(\theta)=\frac{1}{n}\sum_{i=1}^{n}(-\theta_{y_i}^T {\bf{x}}_i + \ln \sum_{j=1}^{M} e^{\theta_j^T{\bf{x}}_i})\tag{3.45}

Polynomial Regression and Regularisation

  • Linear and Logistic Regression may appear rigid and non-flexible compared to kk-NNs and Decision Trees as they are comprised of purely straight lines

  • However, both models are able to adapt to the training data well if the input dimension pp is large relative to the number of data points nn.

  • We can increase the input dimension by performing a non-linear transformation of the input.

  • A simple way of doing this is to replace a one-dimensional input xx with itself raised to different powers (which turns this into polynomial regression)

    • The same can be done for the logit in logistic regression.

y=θ0+θ1x+θ2x2+θ3x3++ε(3.46)y=\theta_0 + \theta_1x + \theta_2x^2 + \theta_3 x^3 + \cdots + \varepsilon \tag{3.46}

  • Consider that in the original linear regression model, if we let x1=x,x2=x2,x3=x3x_1=x, x_2=x^2, x_3=x^3, this is still a linear model, with input x=[1xx2x3]{\bf{x}}=\begin{bmatrix}1&x&x^2&x^3\end{bmatrix}

    • However, in doing this, the input dimensionality has increased from p=1p=1 to p=3p=3.
  • Non-linear input transformations can be very useful, but it effectively increases pp and can easily over fit the model to noise.

  • In the car stopping example, we have our original data, and add a 11 for the offset, term, and x2x^2 for the second order term to produce a new matrix

X=[14.016.014.924.015.025.0139.61568.2139.71576.1],           θ=[θ0θ1θ2],            y=[4.08.08.0134.0110.0](3.47)\begin{align*} X=\begin{bmatrix} 1 & 4.0 & 16.0 \\ 1 & 4.9 & 24.0 \\ 1 & 5.0 & 25.0 \\ \vdots & \vdots & \vdots \\ 1 & 39.6 & 1568.2 \\ 1 & 39.7 & 1576.1 \\ \end{bmatrix}, \ \ \ \ \ \ \ \ \ \ \ \theta = \begin{bmatrix}\theta_0\\\theta_1\\\theta_2\end{bmatrix}, \ \ \ \ \ \ \ \ \ \ \ \ y = \begin{bmatrix} 4.0 \\ 8.0 \\ 8.0 \\ \vdots \\ 134.0 \\ 110.0 \end{bmatrix} \end{align*} \tag{3.47}

Figure 4 - 

  Learning the car stopping distance with linear regression, second-order polynomial regression and 10th order polynomial regression. From this, we can see that the 10th degree polynomial is over fitting to outliers in the data making it less useful than even ordinary linear regression (blue).

Regularisation

  • One way to avoid the over-fitting of the model is to carefully select which inputs (transformations) to include.

    • Forward Selection Strategy Add one input at a time
    • Backward Elimination Strategy Remove inputs that are considered to be redundant
  • We can additionally evaluate different candidate models, and compare using cross validation (discussed in Chapter 4.)

  • Additionally, we can perform regularisation, which is the idea of keeping the parameters θ^\hat\theta small unless the data really convinces us otherwise.

  • There are different ways to implement this mathematically, which result in different regularisation techniques.

L2 Regularisation

  • To keep θ^\hat\theta small, an extra penalty term λθ22\lambda ||\theta ||_2^2 is added to the cost function when using L^2 regularisation

  • Here, λ0\lambda\ge0 is referred to as the regularisation parameter (a hyperparameter chosen by the user, which controls the strength of regularisation effect).

    • This penalty term prevents over fitting
    • Original cost function only rewards fit to training data, the regularisation term prevents overly large parameter values at the cost of a slightly worse fit
    • Therefore, important to choose the value of the regularisation parameter λ\lambda wisely.
    • λ=0\lambda=0 has no regularisation effect whilst λ\lambda\rightarrow\infty will force all parameters θ^\hat\theta to 0.
  • Adding L2L^2 regularisation to the linear regression model with squared loss error (Eq 3.12) yields the following equation

θ^=argminθ1nXθy22+λθ22(3.48) \hat\theta=\arg\min_\theta\frac{1}{n} || {\bf{X}}\theta-y||_2^2 + \lambda ||\theta||_2^2 \tag{3.48}

  • Just like the non-regularised problem, Eq 3.48 has a closed-form solution given by a modified version of the normal equations:

(XTXnλIp+1)θ^=XTy(3.49)\def\x{{\bf{X}}} (\x^T \x _ n\lambda {\bf{I}}_{p+1})\hat\theta=\x^T y\tag{3.49}

  • Note that in this equation Xp+1{\bf{X}}_{p+1} is the identity matrix of size (p+1)×(p+1)(p+1)\times(p+1).

  • This particular application of regularisation is referred to as ridge regularisation.

  • The concept of regularisation is not limited to linear regression

    • The same L2L^2 penalty can be applied to any method that involves the optimisation of a cost function
  • For example, here is the L2L^2 regularisation applied to logistic regression

θ^=argminθ1ni=1nln(1+exp(yiθTxi))+λθ22(3.50)\hat\theta=\arg\min_\theta\frac{1}{n}\sum_{i=1}^{n} \ln (1 + \exp(-y_i \theta^T {\bf{x}}_i)) + \lambda || \theta||_2^2\tag{3.50}

  • Logistic regression is commonly trained using Eq. 3.50 instead of Eq 3.29.
    • This reduces possible issues with over fitting
    • Additionally, have issues with model diverging with some datasets unless L2L^2 is used.

Generalised Linear Models

This section has been omitted as it is not part of the assessable content for this course