Chapter 9 Summary

CourseMachine Learning
SemesterS1 2023

Chapter 9 Summary

The Bayesian Idea

  • Thus far, learning a parametric model amounts to somehow finding a parameter value θ^\hat\theta that best fits the training data.

  • With the Bayesian approach, learning data amounts to finding hte distribution of the parameter values θ\theta conditioned on the observed training data T\mathcal{T} - p(θT)p(\theta|\mathcal{T})

  • The prediction is a distribution p(yx,T)p(y_\star|{\bf x_\star} , \mathcal{T}) instead of a single value.

  • With the Bayesian approach, the parameters of any model are consistently treated as being random variables.

  • Learning amounts to computing the distribution of θ\theta conditional on the training data, denoted p(θy)p(\theta|{\bf y}) since we omit X\bf X.

  • The computation is done using joint distribution factorisation and bayes theorem.

  • By the laws of probabilities, p(y)p({\bf y}) can be written as:

p(y)=p(yθ)p(θ)dθ=p(yθ)p(θ)dθ(9.2)p({\bf y})=\int p({\bf y}|\theta)p(\theta)d\theta = \int p({\bf y}|\theta)p(\theta) d\theta \tag{9.2}

  • Training a parametric model amounts to conditioning θ\theta on y\bf y - p(θy)p(\theta|{\bf y}).
  • After training, the model can be used to compute predictions - a matter of computing distribution p(yx)p(y_\star|{\bf x_\star}) rather than a point prediction, for a given test input x\bf x_\star.

p(yx)=p(yθ)p(θy)dθ(9.3) p(y_\star|{\bf x_\star}) = \int p(y_\star|\theta)p(\theta|{\bf y})d\theta \tag{9.3}

  • Here p(yθ)p(y_\star|\theta) encodes the distribution of the test data output yy_\star in which the corresponding input x\bf x_\star is omitted in the notation.
  • The other elements involved in the Bayesian approach are traditionally given the names:
    • p(θ)p(\theta) - prior
    • p(θy)p(\theta|{\bf y}) - posterior
    • p(yθ)p(y_\star|\theta) - posterior predictive (likelihood)

Representation of Beliefs.

  • The prior represents our beliefs about a θ\theta before any data has been observed.

  • The likelihood p(yθ)p({\bf y}|\theta) defines how data y\bf y relates to the parameter θ\theta

    • Using Bayes theorem, we update the belief about θ\theta to the posterior p(θy)p(\theta | y) which also takes the observed data y\bf y into account.
  • These distributions represent the uncertainty about the parameter θ\theta before and after observing the data y\bf y.

  • The Bayesian approach is less prone to overfitting when compared to the maximum-likelihood based approach.

    • With the maximum likelihood framework, obtain a single value θ^\hat\theta and use that to make our prediction according to p(yθ^)p(y_\star|\hat\theta)
    • With the Bayesian distribution, obtained an entire distribution p(θy)p(\theta|{\bf y}) representing the different hypotheses for the value of our model parameters.
  • In small datasets, the uncertainty seen the posterior p(θy)p(\theta|{\bf y}) represents how much (or little) can be said about θ\theta from the presumably limited information in y\bf y under the assumed conditions.

  • The posterior p(θy)p(\theta|{\bf y}) is a combination of the prior belief p(θ)p(\theta) and the information about θ\theta carried by y\bf y through the likelihood

    • Without a meaningful prior p(θ)p(\theta), the posterior p(θy)p(\theta|{\bf y}) is not meaningful either.

Bayesian Linear Regression

  • Let z\bf z denote a q-dimensional multivariate Gaussian random vector z=[z1x2zq]T\bf z = \begin{bmatrix}z_1&x_2&\dots&z_q\end{bmatrix}^T.
  • The multivariate Gaussian distribution is parameterized by a qq-dimensional mean vector μ\bf \mu and a q×qq\times q covariance matrix Σ\bf \Sigma.

μ=[μ1μ2μq],Σ=[σ12σ12σ1qσ21σ22σ2qσq1σq2σq2]{\bf\mu}=\begin{bmatrix}\mu_1\\\mu_2\\\vdots\\\mu_q\end{bmatrix}, {\bf\Sigma} = \begin{bmatrix} \sigma_{1}^2&\sigma_{12}&\dots&\sigma_{1q}\\ \sigma_{21} & \sigma_{2}^2 & & \sigma_{2q}\\ \vdots & & & \vdots\\ \sigma_{q1} & \sigma_{q2} & \dots & \sigma_{q}^2 \end{bmatrix}

  • The covariance matrix is a real-valued positive semidefinite matrix - a symmetric matrix with nonnegative eigenvalues.
  • The covariance matrix is positive definite if all eigenvalues are positive.As a shorthand, we write zN(μ,Σ)\bf z \sim \mathcal{N}({\bf \mu}, {\bf \Sigma}) or p(z)=N(z;μ,Σ)p({\bf z}) = \mathcal{N}({\bf z} ; {\bf \mu}, {\bf \Sigma}).
  • The expected value of z\bf z is E[z]=μ\mathbb{E}[{\bf z}] = {\bf \mu} and the variance of z1z_1 is var(z1)=E[(z1E[z1])2]=σ12\text{var}(z_1) = \mathbb{E}[(z_1 - \mathbb{E}[z_1])^2]= \sigma_1^2.
  • The covariance between z1,z2z_1, z_2 is cov(z1,z2)=E[(z1E[z1])(z2E[z2])]=σ12=σ21\text{cov}(z_1, z_2) = \mathbb{E}[(z_1 - \mathbb{E}[z_1])(z_2 - \mathbb{E}[z_2])] = \sigma_{12}=\sigma_{21}.

From Chapter 3, the linear regression model is given by:

y=f(x)+ε,f(x)=θTx,εN(0,σ2)(9.6) y=f({\bf x}) + \varepsilon, f({\bf x})=\theta^T{\bf x}, \varepsilon \sim \mathcal{N}(0,\sigma^2)\tag{9.6}

p(yθ)=N(y;θTx,σ2)(9.7) p(y|\theta)=\mathcal{N}(y;\theta^T{\bf x},\sigma^2) \tag{9.7}

  • This expression is for one output point yy, and thus the vector for all training outputs y\bf y is denoted as:

p(yθ)=i=1np(yiθ)=i=1nN(yi;θTxi,σ2)=N(y;Xθ,σ2I(9.8)p({\bf y}|\theta)=\prod_{i=1}^n p(y_i|\theta)=\prod_{i=1}^n\mathcal{N}(y_i;\theta^T{\bf x}_i,\sigma^2) =\mathcal{N}({\bf y};{\bf X}\theta, \sigma^2{\bf I}\tag{9.8}

  • In the last step, use the fact that nn-dimensional Gaussian random vector with diagonal covariance matrix is equivalent to nn scalar Gaussian random variables.
  • With the Bayesian approach, there is also a need for a prior p(θ)p(\theta) for the unknown parameters θ\theta.
  • In Bayesian linear regression, the prior distribution is most often chosen as a linear regression with mean μ0{\bf\mu}_0 covariance Σ0{\bf\Sigma}_0 (for example, Σ0=σ2I{\bf\Sigma}_0=\sigma^2{\bf I})

p(θ)=N(θ;μ0,Σ0)(9.9) p(\theta)=\mathcal{N}(\theta;{\bf\mu}_0,{\bf\Sigma}_0)\tag{9.9}

  • The choice of this is motivated by the fact that it simplifies the computation for linear regression.
  • We now need to compute the posterior distribution p(θy)p(\theta|{\bf y})

p(θy)=N(θ;μn,Σn)μn=Σn(1σ02μ0+1σ2XTy)Σn=(1σ02I+frac1σ02XTX) \begin{align*} p(\theta|{\bf y})&=\mathcal{N}(\theta;{\bf \mu}_n, {\bf \Sigma}_n) \tag{9.10a} \\ {\bf\mu}_n&={\bf\Sigma}_n \left(\frac{1}{\sigma_0^2} {\bf\mu}_0 + \frac{1}{\sigma^2} {\bf X}^T {\bf y}\right) \tag{9.10b} \\ {\bf\Sigma}_n &= \left(\frac{1}{\sigma_0^2} {\bf I} + frac{1}{\sigma_0^2} {\bf X}^T {\bf X}\right) \tag{9.10c} \end{align*}

  • From Equation 9.10, we can also derive the posterior prediction for f(x)f({\bf x}_\star):

p(f(xy))=N(f(x);m,s),m=xTμns=xTΣnx+σ2 \begin{align*} p(f({\bf x_\star} | {\bf y})) &= \mathcal{N}(f({\bf x_\star}); m_\star, s_\star), \tag{9.11a}\\ m_\star&={\bf x}_\star^T {\bf\mu}_n \tag{9.11b}\\ s_\star&={\bf x}_\star^T {\bf\Sigma}_n {\bf x}_\star + \sigma^2 \tag{9.11c} \end{align*}

  • We can also compute the posterior predictive for y\bf y_\star

p(yy)=N(y;m,s+σ2)(9.11d)p(y_\star | {\bf y}) = \mathcal{N}(y_\star; m_\star, s_\star + \sigma^2) \tag{9.11d}

Connection to Regularised Linear Regression

  • The main feature of Bayesian approach is that it provides a full distribution p(θy)p(\theta|{\bf y}) for the parameters θ\theta, rather than a single point estimate.
  • The MAP estimate and the L2 regularised estimate of θ\theta are identical for some value of λ\lambda.

The Gaussian Process

  • Instead of considering the parameters θ\theta as random variables, we can consider the function f(x)f({\bf x}) as a random variable and compute the posterior distribution p(f(x)y)p(f({\bf x})|{\bf y}).

  • The Gaussian Process is a type of stochastic process; a generalisation of a random variable.

  • Consider the concept of stochastic process to random functions with arbitrary inputs {f(x):xX}\lbrace f({\bf x}) : {\bf x} \in \mathcal{X}\rbrace where X\mathcal{X} denotes the (possibly high-dimensional) input space.

    • With this, the function values f(x)f({\bf x}) and f(x)f({\bf x'}) for inputs x,x\bf x, x' are dependent.
  • If we expect the function to be smooth (varies slowly), then the function values f(x)f({\bf x}) and f(x)f({\bf x'}) should be highly correlated if x\bf x and x\bf x' are close.

  • This generalisation allows us to use random functions as priors for unknown functions in a Bayesian setting.

  • Start by making the simplifying assumption that x\bf x is discrete and can only take qq different values, x1,x2,,xq{\bf x}_1, {\bf x}_2, \dots, {\bf x}_q.

    • The function is completely characterised by the vector f=[f1fq]T=[f(x1)f(xq)]T\bf f = \begin{bmatrix}f_1&\cdots&f_q\end{bmatrix}^T=\begin{bmatrix}f({\bf x}_1)& \dots& f({\bf x}_q)\end{bmatrix}^T.
    • We can then model f(x)f({\bf x}) as a random function by assigning a joint probability distribution to this vector f\bf f.
    • In the Gaussian process model, this distribution is the multivariate Gaussian distribution, with mean vector μ\bf\mu and covariance matrix Σ\bf\Sigma.

p(f)=N(f;μ,Σ)(9.15)p({\bf f}) = \mathcal{N}({\bf f};{\bf \mu,\Sigma})\tag{9.15}

  • Let us partition f{\bf f} into two vectors f1{\bf f}_1 and f2{\bf f}_2 such that f=[f1f2]T\bf f = \begin{bmatrix} {\bf f}_1 & {\bf f}_2 \end{bmatrix}^T and do the same thing for μ\bf\mu and Σ\bf\Sigma:

p([f1f2])=N([f1f2];[μ1μ2],[Σ11Σ12Σ21Σ22])(9.16) p\left(\begin{bmatrix}{\bf f}_1 \\ {\bf f}_2 \end{bmatrix}\right) =\mathcal{N} \left( \begin{bmatrix}{\bf f}_1 \\ {\bf f}_2 \end{bmatrix}; \begin{bmatrix}{\bf \mu}_1 \\ {\bf \mu}_2 \end{bmatrix}, \begin{bmatrix}{\bf \Sigma}_{11} & {\bf \Sigma}_{12} \\ {\bf \Sigma}_{21} & {\bf \Sigma}_{22} \end{bmatrix} \right) \tag{9.16}

  • If some elements of f\bf f, say f1{\bf f}_1, are observed, the conditional distribution for f2{\bf f}_2then we can compute the conditional distribution of the remaining elements f2{\bf f}_2 given f1{\bf f}_1 is given as:

p(f2f1)=N(f2;μ2+Σ21Σ111(f1μ1),Σ22Σ21Σ111Σ12)(9.17) p({\bf f}_2|{\bf f}_1) = \mathcal{N}({\bf f}_2;{\bf \mu}_2 + {\bf \Sigma}_{21} {\bf\Sigma}_{11}^{-1}({\bf f}_1-{\bf \mu}_1), {\bf\Sigma}_{22} - {\bf\Sigma}_{21}{\bf\Sigma}_{11}^{-1}{\bf\Sigma}_{12}) \tag{9.17}

  • The conditional distribution is nothing but another Gaussian distribution (with closed-form expressions for mean and covariance).

    Figure 1 - Gaussian distribution for random variables f1 and f2 before and after sampling a value.

  • In the figure, f1{\bf f}_1 is a scalar f1f_1 and f2{\bf f}_2 is a scalar f2f_2.

  • The multivariate Gaussian distribution to the right is now conditioned on an observation of f1f_1 which is reflected on the right side.

    • Both of these Gaussian distributions have the same mean vector and covariance matrix.
  • Since f1{\bf f}_1 and f2{\bf f}_2 are correlated according to the prior, the marginal distribution of f2f_2 is also affected by this distribution.

    Figure 3 - Gaussian distribution for random variables f1 and f2 before and after sampling a value.

  • In a similar fashion to the figure above, we can plot a six-dimensional Gaussian distribution.

  • Assume a positive prior correlation between all elements fif_i and fjf_j which decays with the distance between corresponding inputs xix_i and xjx_j.

  • We condition the six-dimensional distribution underlying the figure on an observation of f4f_4 for example, producing the following plot:

    Figure 4 - A six-dimensional Gaussian distribution conditioned on an observation of f4f_4. Note here that only the marginals in both subplots are plotted.

  • The extension of the Gaussian distribution (defined on a finite set) to the Gaussian process (defined on a continuous space) is achieved by replacing the discrete in dex set {1,2,3,4,5,6}\lbrace 1, 2, 3, 4, 5, 6\rbrace in the figure above by taking a variable x\bf x taking values on a continuous space, for the example the real line.

  • We then have to replace the random variables f1,f2,c,f6f_1, f_2, c\dots, f_6 with a random function (that is, a stochastic process) ff which can be evaluated at any x\bf x for any f(x)f({\bf x}).

  • In the Gaussian multivariate distribution, μ\bf\mu is a vector with qq components, and Σ\bf\Sigma is a q×qq\times q matrix.

    • Instead of having a separate hyperparameter for each element in this mean vector and covariance matrix in the Gaussian process replace μ\bf\mu by the mean function μ(x)\mu({\bf x}) into which we can insert any x\bf x.
    • Likewise, the covariance matrix Σ\bf\Sigma is replaced by the covariance function κ(x,x)\kappa(x,x') into which we can insert any pair x\bf x and x\bf x'.
  • From this, we can define the Gaussian Process - for any arbitrary finite set of points {x1,,xn}\lbrace {\bf x}_1, \dots, {\bf x}_n\rbrace it holds that:

p([f(x1)f(xn)])=N([f(x1)f(xn)];[μ(x1)μ(xn)],[κ(x1,x1)κ(x1,xn)κ(xn,x1)κ(xn,xn)])(9.18)p\left( \begin{bmatrix} f({\bf x}_1) \\ \vdots \\ f({\bf x}_n) \end{bmatrix}\right) = \mathcal{N}\left( \begin{bmatrix} f({\bf x}_1) \\ \vdots \\ f({\bf x}_n) \end{bmatrix}; \begin{bmatrix} \mu({\bf x}_1) \\ \vdots \\ \mu({\bf x}_n) \end{bmatrix}, \begin{bmatrix} \kappa({\bf x}_1, {\bf x}_1) & \cdots & \kappa({\bf x}_1, {\bf x}_n) \\ \vdots & & \vdots \\ \kappa({\bf x}_n, {\bf x}_1) & \cdots & \kappa({\bf x}_n, {\bf x}_n) \end{bmatrix} \right) \tag{9.18}

  • With a gaussian process ff and any choice of {x1,,xn}\lbrace {\bf x}_1, \dots, {\bf x}_n\rbrace, the vector of value functions [f(x1)f(xn)]T\begin{bmatrix} f({\bf x}_1) & \dots & f({\bf x}_n) \end{bmatrix}^T has a multivariate Gaussian distribution.
  • Since {x1,,xn}\lbrace {\bf x}_1, \dots, {\bf x}_n\rbrace can be chosen arbitrarily from the continuous space on which it lives, the Gaussian process defines a distribution for all points on that space.
  • For this definition to make sense, κ(x,x)\kappa({\bf x,x'}) has to be such that a positive semidefinite covariance matrix is obtained for any choice of {x1,,xn}\lbrace {\bf x}_1, \dots, {\bf x}_n\rbrace.
  • Let the following notation describe that the function f(x)f({\bf x}) is distributed according to the Gaussian process with mean function μ(x)\mu({\bf x}) and covariance function κ(x,x)\kappa({\bf x,x'}):

fGP(μ(x),κ(x,x))(9.19)f\sim\mathcal{GP}(\mu({\bf x}), \kappa({\bf x,x'})) \tag{9.19}

  • Use the symbol κ\kappa here for covariance functions as used for Kernels in the previous chapter - will soon discuss applying Bayesian approach to Kernel Ridge Regression results in a Gaussian process where the covariance function is the kernel function.
  • Can also condition the Gaussian process on observed data points.
  • We stack the observed inputs in X\bf X and let f(X)f({\bf X}) denote the vector of observed outputs.
  • Use the notations K(X,X){\bf \Kappa}({\bf X,X}) and K(X,x){\bf\Kappa}({\bf X,x_\star}) to write the joint distribution between the observed values f(X)f({\bf X}) and the test value f(x)f({\bf x_\star}) as

p((f(x)f(X)))=N((f(x)f(X));(μ(x)μ(X)),(κ(x,x)κ(x,X)Tκ(X,x)κ(X,X)))(9.20) p\left( \begin{pmatrix} f({\bf x_\star}) \\ f({\bf X}) \end{pmatrix} \right) = \mathcal{N}\left( \begin{pmatrix} f({\bf x_\star}) \\ f({\bf X}) \end{pmatrix}; \begin{pmatrix} \mu({\bf x_\star}) \\ \mu({\bf X}) \end{pmatrix}, \begin{pmatrix} \kappa({\bf x_\star, x_\star}) & \kappa({\bf x_\star, X})^T \\ \kappa({\bf X, x_\star}) & \kappa({\bf X, X}) \end{pmatrix} \right)\tag{9.20}

  • As we have observed f(X)f({\bf X}) we use the expressions for the Gaussian distribution to write the distribution for f(x)f({\bf x_\star}) conditional on the observations of f(X)f({\bf X}):

p(f(x)f(X))=N(f(x);μ(x)+K(X,x)TK(X,X)1×(f(X)μ(X)),κ(x,x)K(X,x)TK(X,X)1K(X,x))(9.21) p(f({\bf x_\star}) | f({\bf X})) = \mathcal{N}(f({\bf x_\star}); \mu({\bf x_\star}) + {\bf\Kappa}({\bf X}, {\bf x}_\star)^T{\bf\Kappa}({\bf X,X})^{-1} \\ \times (f({\bf X}) -\mu({\bf X})), \kappa({\bf x_\star, x_\star}) - {\bf\Kappa}({\bf X, x_\star})^T{\bf\Kappa({\bf X,X})^{-1}} {\bf\Kappa}({\bf X, x_\star}))\tag{9.21}

  • This produces another Gaussian distribution for any test input x\bf x_\star.

  • We have abstractly introduced the idea of Gaussian Process.
  • White Gaussian process has a white covariance function:

κ(x,x)=I{x=x={1if x=x0otherwise(9.22)\kappa({\bf x, x'}) = \mathbb{I}\lbrace {\bf x} = {\bf x'} = \begin{cases}1&\text{if }{\bf x}={\bf x'}\\0&\text{otherwise}\end{cases} \tag{9.22}

  • This implies that f(x)f({\bf x}) is uncorrelated to f(x)f({\bf x'}) for any xx{\bf x}\neq{\bf x'}.

Extending Kernel Ridge Regression into a Gaussian Process

  • We can use the Kernel trick from Section 8.2 to the Bayesian Linear Regression in Equation 9.11.

  • This will essentially lead us back to Equation 9.21 with the Kernel being the covariance function κ(x,x)\kappa({\bf x, x'}) and the mean function being μ(x)=0\mu({\bf x})=0

  • Repeat the posterior predictive for Bayesian Linear Regression (Equation 9.11) but with two changes:

    • Assume that the prior mean and covariance for θ\theta are μ0=0{\bf\mu}_0={\bf 0} and Σ0=I{\bf\Sigma_0}={\bf I}.
    • This is not strictly needed, but simplifies the expressions.
    • We introduce the non-linear feature transformations ϕ(x)\phi({\bf x}) of the input variable x\bf x in the linear regression model.
    • Replace X\bf X with Φ(X){\bf\Phi}({\bf X}).
  • Through this, we get:

p(f(x)y)=N(f(x);μ,s)m=ϕ(x)T(σ2IΦ(X)TΦ(X))1Φ(X)Tys=ϕ(x)T(I+1σ2Φ(X)TΦ(X))1ϕ(x)\def\xstar{{\bf x_\star}} \def\y{{\bf y}} \begin{align*} p(f(\xstar) | \y) &= \mathcal{N}(f(\xstar); \mu_\star, s_\star) \tag{9.23a} \\ m_\star&=\phi(\xstar)^T (\sigma^2 {\bf I} {\bf\Phi}({\bf X})^T {\bf\Phi}({\bf X}))^{-1} {\bf\Phi}({\bf X})^T {\bf y} \tag{9.23b}\\ s_\star &= \phi(\xstar)^T ({\bf I} + \frac 1{\sigma^2} {\bf\Phi}({\bf X})^T {\bf\Phi}({\bf X}))^{-1} \phi(\xstar) \tag{9.23c} \end{align*}

  • We use the push-through matrix identity once again to re-write mm_\star with the aim of having ϕ(x)\phi({\bf x}) only appearing through inner products:
  • The push-through matrix identity says that A(AT+I)1=(AAT+I)1A\def\A{{\bf A}} \A(\A^T+{\bf I})^{-1} = (\A\A^T + {\bf I})^{-1} \A holds for any matrix A\bf A.

m=ϕ(x)TΦ(X)T(σ2I+Φ(X)Φ(X)T)1y(9.24a) {\bf m}_\star = \phi({\bf x_\star})^T {\bf\Phi}({\bf X})^T \left(\sigma^2 {\bf I} + {\bf \Phi}({\bf X}){\bf\Phi}({\bf X})^T\right)^{-1} {\bf y}\tag{9.24a}

  • We use the matrix inversion lemma to re-write ss_\star in a similar way.
  • The matrix inversion lemma states that the following holds true for matrices with compatible dimensions (IUV)1=IU(I+VU)1V\def\I{{\bf I}}\def\U{{\bf U}}\def\V{{\bf V}}(\I-\U\V)^{-1} = \I - \U(\I+\V\U)^{-1}\V

s=ϕ(x)Tϕ(x)ϕ(x)TΦ(X)T(σ2I+Φ(X)Φ(X)T)1Φ(X)ϕ(x)(9.24b)s_\star = \phi({\bf x_\star})^T\phi({\bf x_\star}) - \phi({\bf x_\star})^T {\bf\Phi}({\bf X})^T \left(\sigma^2 {\bf I} + {\bf \Phi}({\bf X}){\bf\Phi}({\bf X})^T\right)^{-1} {\bf\Phi}({\bf X})\phi({\bf x_\star})\tag{9.24b}

  • We now apply the kernel trick and replace all instances of basis function inner products with the kernel.

m=K(X,x)T(σ2I+K(X,X))1ys=κ(x,x)K(X,x)T(σ2I+K(X,X))1K(X,x) \begin{align*} {\bf m}_\star &= {\bf\Kappa}({\bf X, x_\star})^T \left(\sigma^2 {\bf I} + {\bf\Kappa}({\bf X,X})\right)^{-1} {\bf y}\tag{9.25a}\\ s_\star &= \kappa({\bf x_\star, x_\star}) - {\bf\Kappa}({\bf X, x_\star})^T \left(\sigma^2 {\bf I} + {\bf\Kappa}({\bf X,X})\right)^{-1} {\bf\Kappa}({\bf X, x_\star})\tag{9.25b} \end{align*}

  • The posterior predictive defined in Equation 9.23a and 9.25 is the Gaussian model - it is identical to Equation 9.21 if μ(x)=0\mu({\bf x_\star})=0 and σ2=0\sigma^2=0.
    • The reason for μ(x)=0\mu({\bf x_\star})=0 is that we started the derivation with μ0=0{\bf\mu}_0=0
    • When we derived Equation 9.21 we assumed that the observed f(x)f({\bf x_\star}) rather than y=f(x)+εy_\star=f({\bf x_\star})+\varepsilon which is why σ2=0\sigma^2=0.
  • The Gaussian process is thus a kernel function of Bayesian linear regression much like Kernel Ridge Regression is a kernel version of L2 regularised linear regression.
  • The fact that the kernel plays the role of a covariance function in the Gaussian function gives another interpretation of the kernel - it determines how strong the correlation between f(x)f({\bf x}) and f(x)f({\bf x_\star}) is assumed to be.

A Non-Parametric Distribution over Functions

  • Use the Gaussian process for making predictions (computing the posterior predictive p(f(x)y)p(f({\bf x_\star})|{\bf y})).
  • However, unlike most methods which only deliver a point prediction y^(x)\hat y({\bf x_\star}), the posterior predictive is a distribution.
    • Since we can compute the posterior predictive for any test point, the Gaussian process defines a distribution over functions.
  • If we only consider the mean mm_\star of the posterior predictive, we recover kernel ridge regression.
    • To take full advantage of the Bayesian perspective, also have to consider the posterior predictive variance ss_\star.
    • With most kernels, the predictive variance is smaller if there is a training point nearby, and larger if the closest training point is distance
    • With this, the predictive variance provides a quantification of the uncertainty in the prediction.

Drawing Samples from a Gaussian Process

  • When computing a prediction of f(x)f({\bf x_\star}) the posterior predictive captures all information the Gaussian process has about f(x)f({\bf x_\star}).
    • If we are interested in f(x+δ)f({\bf x_\star} + \delta)the Gaussian process contains more information than is present in the two posterior predictive distributions p(f(x)y)p(f({\bf x_\star})|{\bf y}) and p(f(x+δ)y)p(f({\bf x_\star}+\delta)|{\bf y}) separately.
  • The Gaussian process also contains information about the correlation between the function values f(x)f({\bf x_\star}) and f(x+δ)f({\bf x_\star}+\delta) - pitfall is that p(f(x)y)p(f({\bf x_\star})|{\bf y}) and p(f(x+δ)y)p(f({\bf x_\star}+\delta)|{\bf y}) are only the marginal distributions of the joint distribution p(f(x),f(x+δ)y)p(f({\bf x_\star}), f({\bf x_\star}+\delta)|{\bf y}).
  • A useful alternative therefore can be to visualise the Gaussian process posterior by effectively taking samples from it.

There is some further content on the practical aspects of the gaussian process which I have omitted