Lindholm Chapter 8

Non-Linear Input Transformations and Kernels

Creating Features by Non-Linear Input Transformations

We can make use of arbitrary non-linear transformations of the original input values in any model, including linear regression.
For a one-dimensional input, the linear regression model is given as

y=\theta_0 + \theta_1 x + \varepsilon \tag{8.1}

We can extend this model with $x^2, x^3, \dots, x^d$ as inputs (where $d$ is a user choice) and thus obtain a linear regression model which is a polynomial in $x$

y = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_{d-1} x^{d-1} + \varepsilon = \theta^T \phi(x) + \varepsilon \tag{8.2}

Since $x$ is known, we can directly compute $x^2,\dots,x^{d-1}$ .
Note that this is still a linear regression model since the parameters $\theta$ appear in a linear transformation with $\phi(x)=\begin{bmatrix}1&x&x^2&\dots&x^{d-1}\end{bmatrix}^T$ as a new input vector.
We refer to a transformation of $\bf x$ as a feature, and the vector of transformed inputs ${\bf\phi}({\bf x})$ as a vector of dimension $d\times 1$ as a feature vector.
The parameters $\hat\theta$ are still learned in the same way, but we:

\def\p{{\phi}} \text{ replace the original } , {\bf X} = \underbrace{\begin{bmatrix}{\bf x}_1^T \\{\bf x}_2^T \\ \vdots \\{\bf x}_n^T \end{bmatrix}}_{n\times p + 1} \text{ with the transformed } {\bf\Phi}({\bf X}) = \underbrace{\begin{bmatrix}\p({\bf x})_1^T \\\p({\bf x})_2^T \\ \vdots \\\p({\bf x})_n^T \end{bmatrix}}_{n\times p + 1}

The idea of non-linear input transformations is not limited to linear regression, and any choice of non-linear transformation $\phi(\cdot)$ can be used with any supervised machine learning technique.
The non-linear transformation is applied to the input like a pre-processing step and the transformed input is used when training, evaluating and using the model.
Polynomials are only one out of infinitely many possible choices of features $\phi({\bf x})$ .
- Polynomials higher than second order must be carefully used in practice - exponential behaviour outside of observable range.
There are several alternatives that are often more useful, such as the Fourier series, essentially corresponding to:

\phi (x) = \begin{bmatrix}1 & \sin(x) & \cos(x) & \sin(2x) & \cos(2x) & \cdots \end{bmatrix}^T

The use of non-linear input transformations $\phi (x)$ arguably makes simple models more flexible and applicable to real-world problems with non-linear characteristics.
In order to obtain good performance, important to choose $\phi (x)$ to ensure that enough flexibility is obtained but overfitting is avoided.
Explore the idea of letting the number of features $d\rightarrow\infty$ and combine this with regularisation.
In a sense, this will automate the choice of features, and it leads us to a family of powerful off the shelf machine learning tools called kernel methods.

Kernel Ridge Regression

Ridge Regression = Linear Regression + L2 Regularisation Kernel Ridge Regression = Ridge Regression using Kernels Using kernels allows L2 Linear Regression to use non-linear input transformations

A carefully engineered transformation may work for a specific machine learning problem, but it is not a general solution.
We would like $\phi(x)$ to contain a lot of transformations that could possibly be of interest for most problems - to obtain a general off-the-shelf method.
Explore the idea of letting $d\rightarrow\infty$ .

Re-Formulating Linear Regression

To let $d\rightarrow\infty$ have to use some kind of regularisation to avoid overfitting when $d>n$ .
The optimisation for L2-regularised linear regression is given as:

\hat\theta=\arg\min_\theta = \frac 1n \sum_{i=1}^{n} (\underbrace{\theta^T \phi({\bf x}_i)}_{\hat y({\bf x}_i)} - y_i)^2 + \lambda || \theta ||_2^2 = ({\bf\Phi}({\bf X})^T {\bf \Phi}({\bf X}) + n \lambda {\bf I})^-1 {\bf \phi}({\bf X})^T {\bf y} \tag{8.4a}

At this point, non-linear transformations haven’t been constrained to anything specific yet.
The downside of choosing $d$ (as a fixed value) is that we also have to learn $d$ parameters when training.
In linear regression , usually first learn and store the $d$ -dimensional vector $\hat\theta$ and thereafter use it for computing a prediction.

\hat y({\bf x_\star}) = \hat\theta^T \phi({\bf x_\star}) \tag{8.5}

To be able to choose large $d$ we have to re-formulate the model such that there are no computations or storage demands that scale with $d$ .
To do this, first realise that we can re-formulate the prediction $\hat y ({\bf x_\star})$ as:

\def\P{{\bf \Phi}} \def\X{{\bf X}} \def\I{{\bf I}} \begin{align*} \hat y({\bf x_\star}) = \underbrace{\hat\theta^T}_{1\times d} \quad\underbrace{\phi({\bf x_\star})}_{d\times 1} &= (\P(\X)^T \P(\X) + n\lambda \I)^{-1} \P(\X)^T {\bf y} \phi({\bf x_\star}) \\ &=\underbrace{{\bf y}^T}_{1\times n} \underbrace{ \underbrace{\P(\X)}_{n\times d} \underbrace{(\P(\X)^T \P(\X) + n\lambda \I)^{-1}}_{d\times d} \underbrace{\phi({\bf x_\star})}_{d\times 1} \tag{8.6} }_{n\times 1} \end{align*}

This expression for $\hat y({\bf x_\star})$ suggests that instead of computing and storing the $d$ -dimensional $\hat\theta$ once (independently of $\bf x_\star$ ) we could compute the $n$ -dimensional vector $\def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}}\P(\X)(\P(\X)^T \P(\X) + n\lambda \I)^{-1}\phi({\bf x_\star})$ for each test point.
By doing so, we avoid storing a $d$ -dimensional vector.
However, this technique still requires a matrix inversion which is computationally intensive.
The push-through matrix identity says that $\def\A{{\bf A}} \A(\A^T+{\bf I})^{-1} = (\A\A^T + {\bf I})^{-1} \A$ holds for any matrix $\bf A$ .
Using it in the above equation, we can further re-write the prediction as:

\def\P{{\bf \Phi}} \def\X{{\bf X}} \def\I{{\bf I}} \hat y({\bf x_\star}) = \underbrace{{\bf y}^T}_{1\times n} \quad {\underbrace{(\P(\X) \P(\X)^T + n \lambda \I)}_{n\times n}}^{-1} {\underbrace{\P(\X) \phi({\bf x_\star})}_{n\times 1}} \tag{8.7}

It appears as if we can compute $\hat y({\bf x_\star})$ without having to deal with d-dimensional or matrix, provided that the matrix multiplications $\def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}}(\P(\X) \P(\X)^T)$ and $\def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}}\P(\X)\phi({\bf x_\star})$ can be computed.

\def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}} \def\x{{\bf x}} (\P(\X) \P(\X)^T) = \begin{bmatrix} \phi(\x_1)^T \phi(\x_1) & \phi(\x_1)^T \phi(\x_2) & \cdots & \phi(\x_1)^T \phi(\x_n) \\ \phi(\x_2)^T \phi(\x_1) & \phi(\x_2)^T \phi(\x_2) & \cdots & \phi(\x_2)^T \phi(\x_n) \\ \vdots & & \ddots & \vdots \\ \phi(\x_n)^T \phi(\x_1) & \phi(\x_n)^T \phi(\x_2) & \cdots & \phi(\x_n)^T \phi(\x_n) \\ \end{bmatrix} \tag{8.8}

\def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}} \def\x{{\bf x}} \P(\X)\phi({\bf x_\star})=\begin{bmatrix} \phi(\x_1)^T \phi({\bf x_\star}) \\ \phi(\x_2)^T \phi({\bf x_\star}) \\ \vdots \\ \phi(\x_n)^T \phi({\bf x_\star}) \\ \end{bmatrix} \tag{8.9}

Remember that $\phi({\bf x})^T \phi({\bf x}')$ is the inner product between the two d-dimensional vectors $\phi({\bf x})$ and $\phi({\bf x}')$ .
Note that the transformed inputs $\phi ({\bf x})$ enter into Equation 8.7 only as inner products, where each inner product is a scalar.
That is, if we are able to compute the inner product directly without first explicitly computing the $d$ -dimensional $\phi({\bf x})$ vectors, we can avoid the $d$ -dimensional computations and storage - we have reached our goal.

Consider the case for polynomials - if $p=1$ (meaning that $\bf X$ is a scalar $x$ , and $\phi(x)$ is a third-order polynomial $d=4$ , with the second and third term scaled by $\sqrt 3,^2$ we have:

\phi(x)^T \phi(x') = \begin{bmatrix} 1 & \sqrt 3 x & \sqrt 3 x^2 & x^3 \end{bmatrix} \begin{bmatrix} 1 \\ \sqrt 3 x' \\ \sqrt 3 x'^2 \\ x'^3 \end{bmatrix} \\ 1 + 3xx' + 3x^2 x'^2 + x^3 x'^3 = (1 + xx')^3 \tag{8.10}

It can be shown that if $\phi x$ is a suitable re-scaled polynomial of order $d-1$ then $\phi(x)^T\phi(x')=(1+xx')^{d-1}$ .
Instead of computing the two d-dimensional vectors $\phi(x), \phi(x')$ and then computing the inner product, we could just evaluate the expression $(1+xx')^{d-1}$ directly.
Consider the computational scaling in a situation where $d$ is in the hundreds or thousands.

If we just make a choice of $\phi({\bf x})$ such that the inner product $\phi(x)^T \phi(x')$ can be computed without first computing $\phi(x)$ we can let $d$ be arbitrarily big.

This choice of linear transformation can be quite difficult to create and derive, but we can bypass this using the concept of a kernel.

Kernel

A kernel $\kappa({\bf x}, {\bf x'})$ $κ (x, x^{'})$ is any function that takes in two arguments from the same space and returns a scalar.
- Limit choice of kernel to kernels that are real-valued and symmetric: $\def\x{{\bf x}}\def\k{{\bf \kappa}} \k(\x, \x') = \k(\x', \x)\in\mathbb{R}$
The “kernel” that we used before - the inner product of the input transformations is an example of a kernel:

\kappa({\bf x}, {\bf x'})=\phi({\bf x})^T \phi({\bf x'}) \tag{8.11}

Since $\phi({\bf x})$ $ϕ (x)$ only appears in the linear regression model via inner products, we do not have to design a $d$ $d$ -dimensional vector $\phi({\bf x})$ $ϕ (x)$ and derive its inner product.
- Instead, we can just use the inner product directly. This is known as the kernel trick:

If $\bf x$ enters the model as $\phi({\bf x})^T\phi({\bf x}')$ only we can choose a kernel $\kappa({\bf x}, {\bf x'})$ instead of choosing $\phi({\bf x})$ .

To be clear, we can write Equation 8.7 (linear regression using $\phi({\bf x})$ ) using the kernel

\def\X{{\bf X}}\def\x{{\bf x}}\def\I{{\bf I}}\def\P{{\bf\Phi}}\def\K{{\bf\Kappa}} \begin{align*} \hat y({\bf x_\star}) &= \underbrace{{\bf y}^T}_{1\times n} \quad {\underbrace{(\K(\X,\X) + n\lambda \I)}_{n\times n}}^{-1} {\underbrace{\K(\X, {\bf x_\star})}_{n\times 1}} \tag{8.12b}\\ \text{where} \K(\X,\X)&=\begin{bmatrix} \kappa(\x_1, x_1) & \kappa(\x_1, \x_2) & \cdots & \kappa(\x_1,x_n)\\ \kappa(\x_2, \x_1) & \kappa(\x_2, \x_2) & \cdots & \kappa(\x_2, \x_n)\\ \vdots && \ddots \vdots \\ \kappa(\x_n \x_1) & \kappa(\x_n, \x_2) & \cdots & \kappa(\x_n, \x_n) \end{bmatrix}\tag{8.12b}\\ \K(\X,{\bf x_\star}) &= \begin{bmatrix} \kappa(\x_1, \x_\star) \\ \kappa(\x_2, \x_\star) \\ \vdots \\ \kappa(\x_n, \x_\star) \\ \end{bmatrix}\tag{8.12bc}\\ \end{align*}

These equations describe linear regression with $L^2$ regularisation using a kernel $\kappa({\bf x}, {\bf x}')$
Since $L^2$ regularisation is is also called ridge regression, we refer to Equation 8.12 as kernel ridge regression.
In principle, we may choose the kernel $\kappa({\bf x}, {bf x'})$ $κ (x, b f x^{'})$ arbitrarily, just as long as we can compute Equation 8.12a.
- Requires that the inverse of ${\bf K} ({\bf X,X}) + n \lambda {\bf I}$ exists - positive semidefinite kernels.
Positive semidefinite kernels include squared-exponential (RBF, exponentiated quadratic or Gaussian kernel) shown below in Equation 8.13
- In this kernel function $\ell>0$ is a design choice left to the user.

{\bf \kappa}({\bf x, x'})=\exp\left(-\frac{||{\bf x-x'}||_2^2}{2\ell^2}\right) \tag{8.13}

Another kernel function is the polynomial kernel:

{\bf\kappa}({\bf x,x'}) = (c + {\bf x}^T {\bf x}')^{d-1}

And this kernel has the special case of the linear kernel:

{\bf\kappa}({\bf x,x'}) = c + {\bf x}^T {\bf x}'

From Equation 8.12, it may seem as if we have to compute the inverse of ${\bf \Kappa}({\bf X,X}) + n \lambda {\bf I}$ every time we want to make a prediction
- Not necessary, since the matrix doesn’t input depend on the test point ${\bf x_\star}$ .
- Therefore, we can introduce the $n$ -dimensional vector ${\hat \alpha}$ such that
  $\hat\alpha=\begin{bmatrix}\hat\alpha_1\\\hat\alpha_2\\\vdots\\\hat\alpha_n\end{bmatrix}={\bf y^T} ({\bf \Kappa} ({\bf X,X}) + n\lambda {\bf I})^-1 \tag{8.14a}$
- This then allows us to re-write kernel ridge regression as:
  $\hat{y}({\bf x_\star}) = \hat\alpha {\bf \Kappa} ({\bf X,x_\star})\tag{8.14b}$
- This means that we don’t need to store a $d$ -dimensional vector $\hat\theta$ but we ned to store an $n$ -dimensional vector $\hat\alpha$ and $\bf X$ (since we need to compute ${\bf \Kappa}({\bf X,X})$ ).
The use of kernel ridge regression is summarised as:

Learn Kernel Ridge Regression

Data Training data $\mathcal{T}=\{{\bf x}_i, y_i\}_{i=1}^{n}$ and kernel $\kappa$ Result Learned dual parameters $\hat\alpha$

Compute $\hat\alpha$ as per Equation 8.14a

Predict with Kernel Ridge RegressionData Learned dual parameters $\hat\alpha$ and test input $\bf x_\star$

Compute $\hat y({\bf x_\star})$ as per Equation 8.14b

Support Vector Regression

A version of Kernel Ridge Regression, but with a different loss function

Representer Theorem

Interpret Equation 8.14 as a dual formulation of linear regression, where we have dual parameters $\alpha$ $α$ instead of primal parameters $\theta$ $θ$
- Here the idea of “primal” and “dual” are just different ways of expressing the linear regression idea.
Comparing Equation 8.14a and 8.5, we have:

\hat y ({\bf x_\star})=\hat\theta^T \phi({\bf x_\star}) = \hat\alpha^T \underbrace{{\bf \Phi}({\bf X}) \phi({\bf x_\star})}_{\bf \Kappa({\bf X,x_\star})}\tag{8.15}

This suggests that:

\hat\theta=\Phi({\bf X})^T \hat\alpha \tag{8.16}

This relationship between $\theta$ $θ$ and $\alpha$ $α$ is not specific for kernel ridge regression, but the consequence of a general result called the representer theorem.
- This holds true when $$ \theta% is learned using almost any loss function and L2 regularisation.

Support Vector Regression

The same as Kernel Ridge Regression, but using

\epsilon$-insensitive loss - The use of $\epsilon$-insensitive loss causes the elements of $\hat\alpha$ to become sparse (some elements become zero). - The training points corresponding to the non-zero elements of $\hat\alpha$ are referred to as **support vectors**. - The prediction $\hat y ({\bf x_\star})$ will only only depend on the support vectors. - The $\epsilon$-insensitive loss is given as: $$ L(y,\hat y) = \begin{cases}0 & \text{if } |y-\hat y| < \epsilon, \\ |y - \hat y| - \epsilon & \text{otherwise}\end{cases} \tag{8.17}

The parameter $\epsilon$ is a user choice.
Support Vector Regression makes use of the linear regression model.

\hat y({\bf x_\star}) \theta^T \phi({\bf x_\star})\tag{8.18a}

Instead of using the least-squares cost function as in Equation 8.4, we now have:

\hat\theta=\arg\min_\theta \frac 1n \sum_{i=1}^{n} \max \{0, | y_i - \underbrace{\theta^T \phi({\bf x})}_{\hat y({\bf x}_i)}| - \epsilon\} + \lambda || \theta ||_2^2 \tag{8.18b}

Reformulate the primal formulation in Equation 8.18 into a dual formulation (from using $\theta$ to using $\alpha$ ).
Cannot use the the closed-form derivation - the dual formulation becomes:

\hat y ({\bf x_\star}) = \hat\alpha^T {\bf \Kappa}({\bf X}, {\bf x_\star}) \tag{8.19a}

In this $\hat\alpha$ is the optimisation problem:

\hat\alpha = \arg\min_\alpha \frac 12 \alpha^T {\bf\Kappa}({\bf X}, {\bf X})\alpha-\alpha^T{\bf y} + \epsilon ||\alpha||_1\tag{8.19b}

Equation 8.19b is the equivalent to 8.14b for Kernel Ridge Regression and is a consequent of kernel ridge regression.
A larger value of $\epsilon$ $ϵ$ will result in fewer support vectors.
- The number of support vectors is also affected by $\lambda$ , since $\lambda$ influences the shape of $\hat y({\bf x_\star})$ .
All data is used at training time (for solving Equation 8.19b) but only the support vectors contribute toward the prediction.
Equation 8.19b is a constrained optimisation problem (constrained by Equation 8.19c).

Support Vector Classification

Possible to derive a kernel version of logistic regression with L2 regularisation
Consider the binary classification problem $y\in\{-1,1\}$ and start with the margin formulation of the logistic regression classifier.

\hat y({\bf x}) = \text{sign} \{\theta^T \phi({\bf x_\star})\} \tag{8.32}

If we were to learn $\theta$ using logistic loss, we would obtain logistic regression with non-linear feature transformation $\phi({\bf x})$ from which kernel logistic regression would eventually follow.
Instead, make use of hinge loss function.

L({\bf x}, y, \theta) = \max\{0, 1-y_i \theta^T \phi({\bf x})\} = \begin{cases} 1 - y\theta^T \phi({\bf x}) & \text {if } y\theta^T\phi({\bf x}) < 1 \\ 0 & \text{otherwise} \end{cases}\tag{8.33}

Analogously to the $\epsilon$ -insensitive loss, the hinge loss comes from looking at the dual formulation using $\alpha$ .
First, consider the primal formulation with L2 regularisation.

\hat\theta = \arg\min_\theta \frac 1n \sum_{i=1}^{n} \max\{0, 1-y_i \theta^T \phi({\bf x})\} + \lambda ||\theta||_2^2 \tag{8.34}

Cannot use the kernel trick as the feature vector does not appear as $\phi({\bf x})^T\phi({\bf x})$ , however can get the following formulation:

\begin{align*} \hat\alpha&=\arg\min_\alpha \frac 12 \alpha^T {\bf \Kappa}({\bf X,X})\alpha-\alpha^T y\tag{8.35a}\\ \text{subject to } |\alpha_i| &\le \frac{1}{2n\lambda} \text{ and } 0 \le \alpha_i y_i \tag{8.35b}\\ \text{with } \hat y &= \text{sign}\left(\hat\alpha^T {\bf \Kappa}({\bf X, x_\star})\right)\tag{8.35c} \end{align*}

Support Vector Classification can utilise different kernel functions which result in different decision boundaries
Figure 1 The decision boundaries for support vector classification with linear kernel (left) and squared exponential kernel (right).

As a consequence of using hinge loss, support vector classification provides hard classifications.
- Can use squared hinge loss or Huberised squared hinge loss which allows for a probabilistic interpretation of the margin.