COMP4702 Lecture 9

CourseMachine Learning
SemesterS1 2023

COMP4702 Lecture 9

Chapter 8: Non-Linear Input Transformations and Kernels

Creating Features by Non-Linear Input Transformations

What if we perform some non-linear transformations of the data before passing to the model?

The vanilla linear regression model is denoted as:

y=θ0+θ1x+ε(8.1)y = \theta_0 + \theta_1 x + \varepsilon \tag{8.1}

From this, we can extend the model with x2,x3,,xd1x^2, x^3, \dots, x^{d-1} as inputs in which dd is another hyperparameter, we can obtain the following model:

y=θ0+θ1x+θ2x2++θd1xd1+ε=θTϕ(x)+εy = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_{d-1} x^{d-1} + \varepsilon = \theta^T{\bf\phi} (x) + \varepsilon

Since xx is know, we can directly compute the exponents x2,x3,,xd1x^2, x^3, \dots, x^{d-1}. Note that this is still a linear regression model since the parameters θ\theta appear in a linear fashion with ϕ(x)=[1xx2xd1]T{\bf{\phi}}(x)=\begin{bmatrix}1&x&x^2&\cdots&x^{d-1}\end{bmatrix}^T as a new input vector, a vector of basis functions

  • One of these transformations of xx is known as a feature, and the vector of transformed inputs is denoted ϕ(x){\bf{\phi}}(x).
% Create our dataset
x = rand(20,1);
y = cos(x)
% Add some noise 
y = y + 0.1*randn(20,1);
x1 = linspace(0,1)
y1 = polyval(p1, x1);
hold on;
plot(x1,y1);
xsqd = x.^2;
xcub = x.^3;
z = [x xsqd xcub];

% Now do polynomial fitting on z
b = ones(20,1); % Bias vector
z = [b z];
% Training step, p3 are our theta values
p3 = inv(z'*z)*z'*y;

% Create test set
y3 = zeros(1,100);
for i=1:100
y3(i) = p3(1) + p3(2)*x1(i) + p3(3)*x1(i.^2) + p3(4)*x1(i.^3);
end
plot(x1,y3);

Kernel Ridge Regression

  • The most important idea in this chapter is the use of a kernel function κ(x,x)\kappa({\bf x}, {\bf x}') between two datapoints x{\bf x} and x{\bf x}'.

  • If we have a set of non-linear basis functions ϕ(x){\bf \phi}({\bf x}) then a very useful kernel function is the inner product between these pairs of datapoints:

  • From the example below, it is evident why adding the polynomial inputs allow models to achieve superior performance.

    • With the input space to the left, the red and blue datapoints are not distinguishable by a linear model.
    • However, by introducing x1x2x_1 x_2 as a third input feature, the data becomes (almost perfectly) linearly separable by a single plane.

Figure 1 - Motivation for introducing non-linear basis functions as an input.

κx,x=ϕ(x)Tϕ(x)\bf {\kappa} {\bf x, x'} = \phi({\bf x})^T \phi({\bf x}')

  • We can expression linear regression with MSE and L2L^2 regularisation as follows, replacing xx with ϕ(x)\phi(x) as we want to ues the basis function transformation of each input variable.

θ^=argminθ1n(θTϕ(x)i)2y^(xi)+λθ22=(Φ(X)TΦ(X)+nλI)1Φ(X)Ty(8.4a)\hat\theta=\arg \min_\theta \frac{1}{n} \underbrace{(\theta^T \phi({\bf x})i)^2}_{\hat y({\bf x}_i)} + \lambda ||\theta||_2^2 = ({\bf\Phi}({\bf X})^T{\bf \Phi }({\bf X}) + n\lambda {\bf I})^{-1} {\bf \Phi} ({\bf X})^T{\bf y}\tag{8.4a}

  • We can re-write this equation as inner products of basis functions, yielding Equation 8.7 shown below.

y^(x)=yT1×nΦ(X)Φ(X)T+nλIn×n1Φ(X)ϕ(x)n×1(8.7) \def\p{\bf \Phi} \def\X{\bf X} \hat y({\bf x_\star}) = \underbrace{{\bf y}^T}_{1\times n} {\underbrace{\p(\X) \p(\X)^T + n \lambda {\bf I}}_{n\times n}}^{-1} {\underbrace{\p(\X){\bf\phi} ({\bf x_\star})}_{n\times 1}} \tag{8.7}

  • For a training set X\bf X, the matrix K(X,X){\bf\Kappa}({\bf X}, {\bf X}) is the kernel function applied to all pairs of data points.
    • K(X,x){\bf\Kappa}({\bf{X, x_\star}}) is a vector where we calculate the kernel function between a test point x\bf{x_\star} and all points in the training set.

    • This brings us to kernel ridge regression, where the predictions are given by Equation 8.14b using α^\hat\alpha Equation 8.14a and the kernel function across the training set X{\bf X} and the test point x{\bf x_\star} denoted as K(X,x)\Kappa({\bf X, x_\star}).

      α^=[α^1α^2α^n]=yT(K(X,X)+nλI)1(8.14a)\hat\alpha=\begin{bmatrix}\hat\alpha_1\\\hat\alpha_2\\\vdots\\\hat\alpha_n\end{bmatrix}={\bf y^T} ({\bf \Kappa} ({\bf X,X}) + n\lambda {\bf I})^-1 \tag{8.14a}

      y^(x)=α^K(X,x)(8.14b)\hat{y}({\bf x_\star}) = \hat\alpha {\bf \Kappa} ({\bf X,x_\star})\tag{8.14b}

    • Instead of computing a dd-dimensional vector θ^\hat\theta, we compute and store a nn-dimensional vector α^\hat\alpha (from Equation 8.14a) as well as X{\bf X}.

  • Require that the kernel is positive definite - these kernels are known as positive semidefinite kernels
    • This comes from the requirement that K(X,X)+nλI{\bf \Kappa}({\bf X, X}) + n \lambda {\bf I} has an inverse.
  • One potential choice of the kernel function is the Squared Exponential Kernel / Gaussian kernel:

κ(x,x)=exp((xx)2222)(8.13)\kappa({\bf x, x'})=\exp(-\frac{||{\bf(x-x')}||_2^2}{2\ell^2})\tag{8.13}

  • In Equation 8.13 above, >0\ell>0 is a hyperparameter that is left to be chosen by the user.

  • Kernel Ridge regression predicts based on the value of the kernel function between the test point and all training points weighted by the α^\hat\alpha terms from Equation 8.14a.

  • We can see that when we get α^\hat\alpha by adding “a bit of” identity matrix to K(X,X)\bf\Kappa({\bf X,X}) and inverting that and multiplying by the target values from the training set, yT{\bf{y}}^T, we are essentially adding a bit of noise to the diagonal of the kernel matrix.

  • How does this work?

    • Start by writing down the loss function, and its solution, the prediction function for linear regression (with L2L^2 regularisation) in terms of the transformed data ϕ(x){\bf\phi}({\bf x}) in place of x\bf x
      • We can do this as linear regression has a closed-form solution
    • Use some matrix algebra to realise that we can re-write the solution in such a way that ϕ(x){\bf\phi}({\bf x}) only appears as inner products ϕ(x)Tϕ(x){\bf\phi}({\bf x})^T {\bf\phi}({\bf x}').
    • That gives our kernel function. We can use the kernel function and we will be using the same model, loss function and prediction function
  • Note that when we write linear regression in the previous form (Equation 8.5),we have a dd-dimensional parameter vector θ\theta no matter how big nn is (the size of the training set)

  • Now, we compute the kernel function for all pairs of datapoints, so the size of our “model” is in terms of nn not dd.

  • We can use kernels that allow dd\rightarrow\infty.

Matlab Example

% Create our dataset
x = 5 * rand(50,1);
y = x.^2 + 2*randn(50,1); % Our function is y = x^2 (+ noise)

plot(x,y,'.');
% Number of training points
n = 50;           
% Hyperparameter, weight of regularisation in loss function
lambda = 0.01;    
% Kernel function hyperparameter
l = 5;            
d = pdist(x); % Euclidean distance between datapoints
% Evaluate kernel function
k = exp(-(d.^2)/(2*l^2));
K = squareform(k); % Turn into square matrix
alpha = (y'*inv(K + n*lambda*eye(n,n)));
% Prediction
xtest = 0:0.1:5;
dtest = pdist2(xtest',x);
ktest = exp(-(dtest.^2)/(2*l^2));
ytest = ktest*alpha;
hold on;
plot(xtest,ytest,'k');

Support Vector Regression

  • Support Vector Regression means changing the loss function from that used in kernel ridge regression to the ϵ\epsilon-insensitive loss (Equation 8.17).

    L(y,y^)={0,if yy^<ϵyy^ϵ,otherwise,=max(0,yy^ϵ)\begin{align*} L(y,\hat y)&= \begin{cases} 0, &\text{if } |y-\hat y|\lt\epsilon\\ |y-\hat y|-\epsilon, &\text{otherwise}, \end{cases} \tag{8.17}\\ &=\max(0, |y-\hat y| - \epsilon) \end{align*}

    • This no longer has a closed-form solution, so we require a numerical optimiser to solve.
    • However, the problem is a convex, constrained optimisation problem so there are likely fewer issues than with neural network training
    • Optimising this loss function leads to many of the αi\alpha_i values becoming exactly zero.
      • This means that in the end, the trained model only depends on a small number of the data points to make its predictions

Figure 2 - Support Vector Regression with Epsilon-insensitive loss. The points with non-zero alpha values are known as the support vectors and are highlighted here in red.

  • All datapoints inside the dashed lines don’t affect the prediction
    • The datapoints outside are known as the support vectors, and affect the prediction

Kernel Theory

  • It turns out that several ML models can be re-written as kernel methods.
  • We can do this for k-NN, by writing the (squared) Euclidean Distance in terms of a linear kernel:

κ(x,x)=xTx\kappa ({\bf x,x}')={\bf x}^T {\bf x}'

  • One of the main reasons that kernel methods became widely used in Machine Learning is that they allow you to apply ML techniques to data where Euclidean Distance cannot be defined (e.g. text snippets, in Example 8.4)
    • They effectively allow you to “make up” your own kernel method.
    • If you can create a sensible way to compare datapoints, then you can use kernel methods

Example 8.4 - Kernel k-NN for Interpreting Words using Levenshtein Distance (Edit Distance)

  • With textual inputs, Euclidean distance has no meaning.
  • We can use the Levenshtein Distance (Edit Distance) to compare two strings.
    • This is the number of single-character edits required to transform one word (string) to the other.
  • LD returns a non-negative integer which is only zero if the two strings are equivalent.
    • This fulfils the properties of being a metric on the space of character strings.
  • Using LD, we can construct the kernel as:

κ(x,x)=exp((LD(x,x))222)\kappa(x,x')=\exp\left(-\frac{(LD(x,x'))^2}{2\ell^2}\right)

  • Consider a training set with 10 adjectives (xix_i) and corresponding labels yi{Positive, Negative}y_i \in \lbrace\text{Positive, Negative}\rbrace according to their meaning.
  • Use a kernel kNNk-NN with the kernel defined above to predict whether the word x=x_\star=‘horrendous’ is a positive or negative adjective.
Word, xix_iMeaning, yiy_iLevenshtein Distance, LD(xi,x)LD(x_i,x_\star)Kernel, κ(xi,xi)κ(x,x)2κ(xi,x)\kappa(x_i,x_i)-\kappa(x_\star,x_\star)-2\kappa(x_i,x_\star)
‘Awesome’Positive81.44
‘Excellent’Positive101.73
‘Spotless’Positive91.60
‘Terrific’Positive81.44
‘Tremendous’Positive40.55
‘Awful’Negative91.60
‘Dreadful’Negative61.03
‘Horrific’Negative61.03
‘Terrible’Negative81.44
  • Inspecting the rightmost column, the closest word to horrendous is ‘tremendous’.
    • Thus, if we use k=1k=1, the conclusion would be that ‘horrific’ is a positive word.
  • However, the third, second, fourth closest words are all negative (dreadful, horrific, outrageous).
    • Thus, if we use k=3,k=4k=3,k=4, the conclusion would be that ‘horrific’ is a negative word

Meaning of a Kernel

  • There is a lot of freedom in choice of a Kernel Function.

    • Many kernel functions that are commonly used are positive semidefinite, but for practical purposes, this doesn’t mater (Except for kernel ridge regression)
    • A number of example functions are discussed in the book.
  • The kernel defines how close (or similar) any two data points are.

    • if κ(xi,x)>κ(xj,x)\def\x{{\bf x}}\kappa(\x_i, \x_\star) > \kappa (\x_j, x_\star) then x{\bf x}_\star is more similar to xi{\bf x}_i than xj{\bf x}_j.
    • For most methods, the prediction y^(x)\hat y ({\bf {x}}_\star) is most influenced by the training points that are closest or most similar to x{\bf x}_\star
  • Even though we started by introducing kernels via the inner product ϕ(x)Tϕ(x){\bf\phi}({\bf x})^T{\bf\phi}({\bf x}'), we do not have to bother about the inner product for the space itself in which x\bf x lives.

    • We can also apply a positive semi-definite kernel method to text strings without worrying about the inner product of strings, as long as we have a kernel for that type of data.

Valid Kernels

  • There are many examples of kernels, and the book goes into detail regarding what constitutes a valid kernel.
  • Examples of kernels include the linear kernel (Equation 8.23) and the polynomial kernel (Equation 8.24) in which the order of the polynomial is d1d-1

κ(x,x)=xTx(8.23)\kappa({\bf x, x}')={\bf x}^T{\bf x}'\tag{8.23}

κ(x,x)=(c+xTx)d1(8.24)\kappa({\bf x, x}')=(c +{\bf x}^T{\bf x}')^{d-1}\tag{8.24}

Support Vector Classification

  • We can reason about Support Vector Classification using the same argument with logistic regression (as we did for linear regression + Support Vector Regression)

  • If we use the hinge loss function for logistic regression (with transformed inputs; output of basis function), we end up with a training problem that is convex-constrained optimisation problem (which requires a numerical solver)

    • When this is solved, the solution again tends to be sparse (many αi=0\alpha_i=0).
  • When the margin for a datapoint is >1>1, then αi=0\alpha_i=0 so in the end these datapoints don’t define the decision boundary.

  • When the margin for the data point ii is <=1<=1 the point is inside the margin or on the wrong side of the decision boundary.

  • The classification model is given as:

y^(x)=sign(α^TK(X,x))(8.35c)\hat{y}({\bf x_\star}) =\text{sign}(\hat\alpha^T{\bf\Kappa}({\bf X,x_\star}))\tag{8.35c}

Figure 3 - SVM for Classification with Linear and Squared exponential kernel. The points in yellow are the support vectors.

  • Note that λ>0\lambda>0 is a regularisation parameter that penalises points that are inside the margin.
  • Im ML libraries, you will often find this parameter as C=1λC=\frac{1}{\lambda} or possibly even ν\nu
  • The effect of this can be seen in the horizontal of Figure 3 above.