Lindholm Chapter 8

CourseMachine Learning
SemesterS1 2023

Lindholm Chapter 8

Non-Linear Input Transformations and Kernels

Creating Features by Non-Linear Input Transformations

  • We can make use of arbitrary non-linear transformations of the original input values in any model, including linear regression.
  • For a one-dimensional input, the linear regression model is given as

y=θ0+θ1x+ε(8.1)y=\theta_0 + \theta_1 x + \varepsilon \tag{8.1}

  • We can extend this model with x2,x3,,xdx^2, x^3, \dots, x^d as inputs (where dd is a user choice) and thus obtain a linear regression model which is a polynomial in xx

y=θ0+θ1x+θ2x2++θd1xd1+ε=θTϕ(x)+ε(8.2) y = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_{d-1} x^{d-1} + \varepsilon = \theta^T \phi(x) + \varepsilon \tag{8.2}

  • Since xx is known, we can directly compute x2,,xd1x^2,\dots,x^{d-1}.
  • Note that this is still a linear regression model since the parameters θ\theta appear in a linear transformation with ϕ(x)=[1xx2xd1]T\phi(x)=\begin{bmatrix}1&x&x^2&\dots&x^{d-1}\end{bmatrix}^T as a new input vector.
  • We refer to a transformation of x\bf x as a feature, and the vector of transformed inputs ϕ(x){\bf\phi}({\bf x}) as a vector of dimension d×1d\times 1 as a feature vector.
  • The parameters θ^\hat\theta are still learned in the same way, but we:

 replace the original ,X=[x1Tx2TxnT]n×p+1 with the transformed Φ(X)=[ϕ(x)1Tϕ(x)2Tϕ(x)nT]n×p+1 \def\p{{\phi}} \text{ replace the original } , {\bf X} = \underbrace{\begin{bmatrix}{\bf x}_1^T \\{\bf x}_2^T \\ \vdots \\{\bf x}_n^T \end{bmatrix}}_{n\times p + 1} \text{ with the transformed } {\bf\Phi}({\bf X}) = \underbrace{\begin{bmatrix}\p({\bf x})_1^T \\\p({\bf x})_2^T \\ \vdots \\\p({\bf x})_n^T \end{bmatrix}}_{n\times p + 1}

  • The idea of non-linear input transformations is not limited to linear regression, and any choice of non-linear transformation ϕ()\phi(\cdot) can be used with any supervised machine learning technique.

  • The non-linear transformation is applied to the input like a pre-processing step and the transformed input is used when training, evaluating and using the model.

  • Polynomials are only one out of infinitely many possible choices of features ϕ(x)\phi({\bf x}).

    • Polynomials higher than second order must be carefully used in practice - exponential behaviour outside of observable range.
  • There are several alternatives that are often more useful, such as the Fourier series, essentially corresponding to:

ϕ(x)=[1sin(x)cos(x)sin(2x)cos(2x)]T\phi (x) = \begin{bmatrix}1 & \sin(x) & \cos(x) & \sin(2x) & \cos(2x) & \cdots \end{bmatrix}^T

  • The use of non-linear input transformations ϕ(x)\phi (x) arguably makes simple models more flexible and applicable to real-world problems with non-linear characteristics.

  • In order to obtain good performance, important to choose ϕ(x)\phi (x) to ensure that enough flexibility is obtained but overfitting is avoided.

  • Explore the idea of letting the number of features dd\rightarrow\infty and combine this with regularisation.

  • In a sense, this will automate the choice of features, and it leads us to a family of powerful off the shelf machine learning tools called kernel methods.

Kernel Ridge Regression

Ridge Regression = Linear Regression + L2 Regularisation Kernel Ridge Regression = Ridge Regression using Kernels Using kernels allows L2 Linear Regression to use non-linear input transformations

  • A carefully engineered transformation may work for a specific machine learning problem, but it is not a general solution.
  • We would like ϕ(x)\phi(x) to contain a lot of transformations that could possibly be of interest for most problems - to obtain a general off-the-shelf method.
  • Explore the idea of letting dd\rightarrow\infty.

Re-Formulating Linear Regression

  • To let dd\rightarrow\infty have to use some kind of regularisation to avoid overfitting when d>nd>n.
  • The optimisation for L2-regularised linear regression is given as:

θ^=argminθ=1ni=1n(θTϕ(xi)y^(xi)yi)2+λθ22=(Φ(X)TΦ(X)+nλI)1ϕ(X)Ty(8.4a)\hat\theta=\arg\min_\theta = \frac 1n \sum_{i=1}^{n} (\underbrace{\theta^T \phi({\bf x}_i)}_{\hat y({\bf x}_i)} - y_i)^2 + \lambda || \theta ||_2^2 = ({\bf\Phi}({\bf X})^T {\bf \Phi}({\bf X}) + n \lambda {\bf I})^-1 {\bf \phi}({\bf X})^T {\bf y} \tag{8.4a}

  • At this point, non-linear transformations haven’t been constrained to anything specific yet.
  • The downside of choosing dd (as a fixed value) is that we also have to learn dd parameters when training.
  • In linear regression , usually first learn and store the dd-dimensional vector θ^\hat\theta and thereafter use it for computing a prediction.

y^(x)=θ^Tϕ(x)(8.5)\hat y({\bf x_\star}) = \hat\theta^T \phi({\bf x_\star}) \tag{8.5}

  • To be able to choose large dd we have to re-formulate the model such that there are no computations or storage demands that scale with dd.
  • To do this, first realise that we can re-formulate the prediction y^(x)\hat y ({\bf x_\star}) as:

y^(x)=θ^T1×dϕ(x)d×1=(Φ(X)TΦ(X)+nλI)1Φ(X)Tyϕ(x)=yT1×nΦ(X)n×d(Φ(X)TΦ(X)+nλI)1d×dϕ(x)d×1n×1\def\P{{\bf \Phi}} \def\X{{\bf X}} \def\I{{\bf I}} \begin{align*} \hat y({\bf x_\star}) = \underbrace{\hat\theta^T}_{1\times d} \quad\underbrace{\phi({\bf x_\star})}_{d\times 1} &= (\P(\X)^T \P(\X) + n\lambda \I)^{-1} \P(\X)^T {\bf y} \phi({\bf x_\star}) \\ &=\underbrace{{\bf y}^T}_{1\times n} \underbrace{ \underbrace{\P(\X)}_{n\times d} \underbrace{(\P(\X)^T \P(\X) + n\lambda \I)^{-1}}_{d\times d} \underbrace{\phi({\bf x_\star})}_{d\times 1} \tag{8.6} }_{n\times 1} \end{align*}

  • This expression for y^(x)\hat y({\bf x_\star}) suggests that instead of computing and storing the dd-dimensional θ^\hat\theta once (independently of x\bf x_\star) we could compute the nn-dimensional vector Φ(X)(Φ(X)TΦ(X)+nλI)1ϕ(x)\def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}}\P(\X)(\P(\X)^T \P(\X) + n\lambda \I)^{-1}\phi({\bf x_\star}) for each test point.
  • By doing so, we avoid storing a dd-dimensional vector.
  • However, this technique still requires a matrix inversion which is computationally intensive.
  • The push-through matrix identity says that A(AT+I)1=(AAT+I)1A\def\A{{\bf A}} \A(\A^T+{\bf I})^{-1} = (\A\A^T + {\bf I})^{-1} \A holds for any matrix A\bf A.
  • Using it in the above equation, we can further re-write the prediction as:

y^(x)=yT1×n(Φ(X)Φ(X)T+nλI)n×n1Φ(X)ϕ(x)n×1(8.7)\def\P{{\bf \Phi}} \def\X{{\bf X}} \def\I{{\bf I}} \hat y({\bf x_\star}) = \underbrace{{\bf y}^T}_{1\times n} \quad {\underbrace{(\P(\X) \P(\X)^T + n \lambda \I)}_{n\times n}}^{-1} {\underbrace{\P(\X) \phi({\bf x_\star})}_{n\times 1}} \tag{8.7}

  • It appears as if we can compute y^(x)\hat y({\bf x_\star}) without having to deal with d-dimensional or matrix, provided that the matrix multiplications (Φ(X)Φ(X)T)\def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}}(\P(\X) \P(\X)^T) and Φ(X)ϕ(x)\def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}}\P(\X)\phi({\bf x_\star}) can be computed.

(Φ(X)Φ(X)T)=[ϕ(x1)Tϕ(x1)ϕ(x1)Tϕ(x2)ϕ(x1)Tϕ(xn)ϕ(x2)Tϕ(x1)ϕ(x2)Tϕ(x2)ϕ(x2)Tϕ(xn)ϕ(xn)Tϕ(x1)ϕ(xn)Tϕ(x2)ϕ(xn)Tϕ(xn)](8.8) \def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}} \def\x{{\bf x}} (\P(\X) \P(\X)^T) = \begin{bmatrix} \phi(\x_1)^T \phi(\x_1) & \phi(\x_1)^T \phi(\x_2) & \cdots & \phi(\x_1)^T \phi(\x_n) \\ \phi(\x_2)^T \phi(\x_1) & \phi(\x_2)^T \phi(\x_2) & \cdots & \phi(\x_2)^T \phi(\x_n) \\ \vdots & & \ddots & \vdots \\ \phi(\x_n)^T \phi(\x_1) & \phi(\x_n)^T \phi(\x_2) & \cdots & \phi(\x_n)^T \phi(\x_n) \\ \end{bmatrix} \tag{8.8}

Φ(X)ϕ(x)=[ϕ(x1)Tϕ(x)ϕ(x2)Tϕ(x)ϕ(xn)Tϕ(x)](8.9) \def\P{{\bf \Phi}}\def\X{{\bf X}}\def\I{{\bf I}} \def\x{{\bf x}} \P(\X)\phi({\bf x_\star})=\begin{bmatrix} \phi(\x_1)^T \phi({\bf x_\star}) \\ \phi(\x_2)^T \phi({\bf x_\star}) \\ \vdots \\ \phi(\x_n)^T \phi({\bf x_\star}) \\ \end{bmatrix} \tag{8.9}

  • Remember that ϕ(x)Tϕ(x)\phi({\bf x})^T \phi({\bf x}') is the inner product between the two d-dimensional vectors ϕ(x)\phi({\bf x}) and ϕ(x)\phi({\bf x}').
  • Note that the transformed inputs ϕ(x)\phi ({\bf x}) enter into Equation 8.7 only as inner products, where each inner product is a scalar.
  • That is, if we are able to compute the inner product directly without first explicitly computing the dd-dimensional ϕ(x)\phi({\bf x}) vectors, we can avoid the dd-dimensional computations and storage - we have reached our goal.

  • Consider the case for polynomials - if p=1p=1 (meaning that X\bf X is a scalar xx, and ϕ(x)\phi(x) is a third-order polynomial d=4d=4, with the second and third term scaled by 3,2\sqrt 3,^2 we have:

ϕ(x)Tϕ(x)=[13x3x2x3][13x3x2x3]1+3xx+3x2x2+x3x3=(1+xx)3(8.10)\phi(x)^T \phi(x') = \begin{bmatrix} 1 & \sqrt 3 x & \sqrt 3 x^2 & x^3 \end{bmatrix} \begin{bmatrix} 1 \\ \sqrt 3 x' \\ \sqrt 3 x'^2 \\ x'^3 \end{bmatrix} \\ 1 + 3xx' + 3x^2 x'^2 + x^3 x'^3 = (1 + xx')^3 \tag{8.10}

  • It can be shown that if ϕx\phi x is a suitable re-scaled polynomial of order d1d-1 then ϕ(x)Tϕ(x)=(1+xx)d1\phi(x)^T\phi(x')=(1+xx')^{d-1}.
  • Instead of computing the two d-dimensional vectors ϕ(x),ϕ(x)\phi(x), \phi(x') and then computing the inner product, we could just evaluate the expression (1+xx)d1(1+xx')^{d-1} directly.
  • Consider the computational scaling in a situation where dd is in the hundreds or thousands.

If we just make a choice of ϕ(x)\phi({\bf x}) such that the inner product ϕ(x)Tϕ(x)\phi(x)^T \phi(x') can be computed without first computing ϕ(x)\phi(x) we can let dd be arbitrarily big.

  • This choice of linear transformation can be quite difficult to create and derive, but we can bypass this using the concept of a kernel.

Kernel

  • A kernel κ(x,x)\kappa({\bf x}, {\bf x'}) is any function that takes in two arguments from the same space and returns a scalar.
    • Limit choice of kernel to kernels that are real-valued and symmetric: κ(x,x)=κ(x,x)R\def\x{{\bf x}}\def\k{{\bf \kappa}} \k(\x, \x') = \k(\x', \x)\in\mathbb{R}
  • The “kernel” that we used before - the inner product of the input transformations is an example of a kernel:

κ(x,x)=ϕ(x)Tϕ(x)(8.11) \kappa({\bf x}, {\bf x'})=\phi({\bf x})^T \phi({\bf x'}) \tag{8.11}

  • Since ϕ(x)\phi({\bf x}) only appears in the linear regression model via inner products, we do not have to design a dd-dimensional vector ϕ(x)\phi({\bf x}) and derive its inner product.
    • Instead, we can just use the inner product directly. This is known as the kernel trick:

If x\bf x enters the model as ϕ(x)Tϕ(x)\phi({\bf x})^T\phi({\bf x}') only we can choose a kernel κ(x,x)\kappa({\bf x}, {\bf x'}) instead of choosing ϕ(x)\phi({\bf x}).

  • To be clear, we can write Equation 8.7 (linear regression using ϕ(x)\phi({\bf x})) using the kernel

y^(x)=yT1×n(K(X,X)+nλI)n×n1K(X,x)n×1whereK(X,X)=[κ(x1,x1)κ(x1,x2)κ(x1,xn)κ(x2,x1)κ(x2,x2)κ(x2,xn)κ(xnx1)κ(xn,x2)κ(xn,xn)]K(X,x)=[κ(x1,x)κ(x2,x)κ(xn,x)] \def\X{{\bf X}}\def\x{{\bf x}}\def\I{{\bf I}}\def\P{{\bf\Phi}}\def\K{{\bf\Kappa}} \begin{align*} \hat y({\bf x_\star}) &= \underbrace{{\bf y}^T}_{1\times n} \quad {\underbrace{(\K(\X,\X) + n\lambda \I)}_{n\times n}}^{-1} {\underbrace{\K(\X, {\bf x_\star})}_{n\times 1}} \tag{8.12b}\\ \text{where} \K(\X,\X)&=\begin{bmatrix} \kappa(\x_1, x_1) & \kappa(\x_1, \x_2) & \cdots & \kappa(\x_1,x_n)\\ \kappa(\x_2, \x_1) & \kappa(\x_2, \x_2) & \cdots & \kappa(\x_2, \x_n)\\ \vdots && \ddots \vdots \\ \kappa(\x_n \x_1) & \kappa(\x_n, \x_2) & \cdots & \kappa(\x_n, \x_n) \end{bmatrix}\tag{8.12b}\\ \K(\X,{\bf x_\star}) &= \begin{bmatrix} \kappa(\x_1, \x_\star) \\ \kappa(\x_2, \x_\star) \\ \vdots \\ \kappa(\x_n, \x_\star) \\ \end{bmatrix}\tag{8.12bc}\\ \end{align*}

  • These equations describe linear regression with L2L^2 regularisation using a kernel κ(x,x)\kappa({\bf x}, {\bf x}')
  • Since L2L^2 regularisation is is also called ridge regression, we refer to Equation 8.12 as kernel ridge regression.
  • In principle, we may choose the kernel κ(x,bfx)\kappa({\bf x}, {bf x'}) arbitrarily, just as long as we can compute Equation 8.12a.
    • Requires that the inverse of K(X,X)+nλI{\bf K} ({\bf X,X}) + n \lambda {\bf I} exists - positive semidefinite kernels.
  • Positive semidefinite kernels include squared-exponential (RBF, exponentiated quadratic or Gaussian kernel) shown below in Equation 8.13
    • In this kernel function >0\ell>0 is a design choice left to the user.

κ(x,x)=exp(xx2222)(8.13) {\bf \kappa}({\bf x, x'})=\exp\left(-\frac{||{\bf x-x'}||_2^2}{2\ell^2}\right) \tag{8.13}

  • Another kernel function is the polynomial kernel:

κ(x,x)=(c+xTx)d1 {\bf\kappa}({\bf x,x'}) = (c + {\bf x}^T {\bf x}')^{d-1}

  • And this kernel has the special case of the linear kernel:

κ(x,x)=c+xTx {\bf\kappa}({\bf x,x'}) = c + {\bf x}^T {\bf x}'

  • From Equation 8.12, it may seem as if we have to compute the inverse of K(X,X)+nλI{\bf \Kappa}({\bf X,X}) + n \lambda {\bf I} every time we want to make a prediction

    • Not necessary, since the matrix doesn’t input depend on the test point x{\bf x_\star}.

    • Therefore, we can introduce the nn-dimensional vector α^{\hat \alpha} such that

      α^=[α^1α^2α^n]=yT(K(X,X)+nλI)1(8.14a)\hat\alpha=\begin{bmatrix}\hat\alpha_1\\\hat\alpha_2\\\vdots\\\hat\alpha_n\end{bmatrix}={\bf y^T} ({\bf \Kappa} ({\bf X,X}) + n\lambda {\bf I})^-1 \tag{8.14a}

    • This then allows us to re-write kernel ridge regression as:

      y^(x)=α^K(X,x)(8.14b)\hat{y}({\bf x_\star}) = \hat\alpha {\bf \Kappa} ({\bf X,x_\star})\tag{8.14b}

    • This means that we don’t need to store a dd-dimensional vector θ^\hat\theta but we ned to store an nn-dimensional vector α^\hat\alpha and X\bf X (since we need to compute K(X,X){\bf \Kappa}({\bf X,X})).

  • The use of kernel ridge regression is summarised as:

Learn Kernel Ridge Regression

Data Training data T={xi,yi}i=1n\mathcal{T}=\{{\bf x}_i, y_i\}_{i=1}^{n} and kernel κ\kappaResult Learned dual parameters α^\hat\alpha

  1. Compute α^\hat\alpha as per Equation 8.14a

Predict with Kernel Ridge RegressionData Learned dual parameters α^\hat\alpha and test input x\bf x_\star

  1. Compute y^(x)\hat y({\bf x_\star}) as per Equation 8.14b

Support Vector Regression

A version of Kernel Ridge Regression, but with a different loss function

Representer Theorem

  • Interpret Equation 8.14 as a dual formulation of linear regression, where we have dual parameters α\alpha instead of primal parameters θ\theta
    • Here the idea of “primal” and “dual” are just different ways of expressing the linear regression idea.
  • Comparing Equation 8.14a and 8.5, we have:

y^(x)=θ^Tϕ(x)=α^TΦ(X)ϕ(x)K(X,x)(8.15)\hat y ({\bf x_\star})=\hat\theta^T \phi({\bf x_\star}) = \hat\alpha^T \underbrace{{\bf \Phi}({\bf X}) \phi({\bf x_\star})}_{\bf \Kappa({\bf X,x_\star})}\tag{8.15}

  • This suggests that:

θ^=Φ(X)Tα^(8.16) \hat\theta=\Phi({\bf X})^T \hat\alpha \tag{8.16}

  • This relationship between θ\theta and α\alphais not specific for kernel ridge regression, but the consequence of a general result called the representer theorem.
    • This holds true when $$ \theta% is learned using almost any loss function and L2 regularisation.

Support Vector Regression

The same as Kernel Ridge Regression, but using

\epsilon$-insensitive loss - The use of $\epsilon$-insensitive loss causes the elements of $\hat\alpha$ to become sparse (some elements become zero). - The training points corresponding to the non-zero elements of $\hat\alpha$ are referred to as **support vectors**. - The prediction $\hat y ({\bf x_\star})$ will only only depend on the support vectors. - The $\epsilon$-insensitive loss is given as: $$ L(y,\hat y) = \begin{cases}0 & \text{if } |y-\hat y| < \epsilon, \\ |y - \hat y| - \epsilon & \text{otherwise}\end{cases} \tag{8.17}

  • The parameter ϵ\epsilon is a user choice.
  • Support Vector Regression makes use of the linear regression model.

y^(x)θTϕ(x)(8.18a) \hat y({\bf x_\star}) \theta^T \phi({\bf x_\star})\tag{8.18a}

  • Instead of using the least-squares cost function as in Equation 8.4, we now have:

θ^=argminθ1ni=1nmax{0,yiθTϕ(x)y^(xi)ϵ}+λθ22(8.18b) \hat\theta=\arg\min_\theta \frac 1n \sum_{i=1}^{n} \max \{0, | y_i - \underbrace{\theta^T \phi({\bf x})}_{\hat y({\bf x}_i)}| - \epsilon\} + \lambda || \theta ||_2^2 \tag{8.18b}

  • Reformulate the primal formulation in Equation 8.18 into a dual formulation (from using θ\theta to using α\alpha).
  • Cannot use the the closed-form derivation - the dual formulation becomes:

y^(x)=α^TK(X,x)(8.19a)\hat y ({\bf x_\star}) = \hat\alpha^T {\bf \Kappa}({\bf X}, {\bf x_\star}) \tag{8.19a}

  • In this α^\hat\alpha is the optimisation problem:

α^=argminα12αTK(X,X)ααTy+ϵα1(8.19b) \hat\alpha = \arg\min_\alpha \frac 12 \alpha^T {\bf\Kappa}({\bf X}, {\bf X})\alpha-\alpha^T{\bf y} + \epsilon ||\alpha||_1\tag{8.19b}

  • Equation 8.19b is the equivalent to 8.14b for Kernel Ridge Regression and is a consequent of kernel ridge regression.
  • A larger value of ϵ\epsilon will result in fewer support vectors.
    • The number of support vectors is also affected by λ\lambda, since λ\lambda influences the shape of y^(x)\hat y({\bf x_\star}).
  • All data is used at training time (for solving Equation 8.19b) but only the support vectors contribute toward the prediction.
  • Equation 8.19b is a constrained optimisation problem (constrained by Equation 8.19c).

Support Vector Classification

  • Possible to derive a kernel version of logistic regression with L2 regularisation
  • Consider the binary classification problem y{1,1}y\in\{-1,1\} and start with the margin formulation of the logistic regression classifier.

y^(x)=sign{θTϕ(x)}(8.32)\hat y({\bf x}) = \text{sign} \{\theta^T \phi({\bf x_\star})\} \tag{8.32}

  • If we were to learn θ\theta using logistic loss, we would obtain logistic regression with non-linear feature transformation ϕ(x)\phi({\bf x}) from which kernel logistic regression would eventually follow.
  • Instead, make use of hinge loss function.

L(x,y,θ)=max{0,1yiθTϕ(x)}={1yθTϕ(x)if yθTϕ(x)<10otherwise(8.33) L({\bf x}, y, \theta) = \max\{0, 1-y_i \theta^T \phi({\bf x})\} = \begin{cases} 1 - y\theta^T \phi({\bf x}) & \text {if } y\theta^T\phi({\bf x}) < 1 \\ 0 & \text{otherwise} \end{cases}\tag{8.33}

  • Analogously to the ϵ\epsilon-insensitive loss, the hinge loss comes from looking at the dual formulation using α\alpha.
  • First, consider the primal formulation with L2 regularisation.

θ^=argminθ1ni=1nmax{0,1yiθTϕ(x)}+λθ22(8.34) \hat\theta = \arg\min_\theta \frac 1n \sum_{i=1}^{n} \max\{0, 1-y_i \theta^T \phi({\bf x})\} + \lambda ||\theta||_2^2 \tag{8.34}

  • Cannot use the kernel trick as the feature vector does not appear as ϕ(x)Tϕ(x)\phi({\bf x})^T\phi({\bf x}), however can get the following formulation:

α^=argminα12αTK(X,X)ααTysubject to αi12nλ and 0αiyiwith y^=sign(α^TK(X,x)) \begin{align*} \hat\alpha&=\arg\min_\alpha \frac 12 \alpha^T {\bf \Kappa}({\bf X,X})\alpha-\alpha^T y\tag{8.35a}\\ \text{subject to } |\alpha_i| &\le \frac{1}{2n\lambda} \text{ and } 0 \le \alpha_i y_i \tag{8.35b}\\ \text{with } \hat y &= \text{sign}\left(\hat\alpha^T {\bf \Kappa}({\bf X, x_\star})\right)\tag{8.35c} \end{align*}

  • Support Vector Classification can utilise different kernel functions which result in different decision boundaries

    Figure 1 The decision boundaries for support vector classification with linear kernel (left) and squared exponential kernel (right).

  • As a consequence of using hinge loss, support vector classification provides hard classifications.
    • Can use squared hinge loss or Huberised squared hinge loss which allows for a probabilistic interpretation of the margin.