COMP4702 Lecture 11

CourseMachine Learning
SemesterS1 2023

COMP4702 Lecture 11

Chapter 10: Generative Models

  • Previous chapters have been mostly about supervised learning, which generate models of p(yx)p(y|{\bf x}).
  • Generative models are models of p(x,y,θ)p({\bf x}, y, \theta).
    • Dropping the yy term would make this unsupervised learning.
  • Generative models are very similar to probabilistic models and computer simulation models - they all draw from some probability distribution.
  • So we can use a probability density estimation and think of this as a foundation for unsupervised learning and generative models.

Gaussian Mixture Models and Discriminant Analysis

Mixture Densities

  • A mixture model is a weighted sum of component densities e.g. this case, Gaussian Distributions

p(x)=i=1kp(xGi)P(Gi)p({\bf x})=\sum_{i=1}^{k} p({\bf x}|G_i) P(G_i)

  • Where GiG_i is the iith component (aka group or cluster assumed to be in the data)
    • P(Gi)P(G_i) is the prior/mixture coefficient or weight of the iith component and p(xGi)p({\bf x} | G_i) is the iith component PDF.
    • Note that 0P(Gi)1i0 \le P(G_i)\le 1 \forall i and i=1kP(Gi)=1\sum_{i=1}^{k} P(G_i) = 1.
  • We will look at Gaussian Mixture Models (GMMs), so:

p(xGi)=N(μi,Σi)p({\bf x} | G_i) = \mathcal{N}({\bf \mu}_i, \Sigma_i)

  • If for a single variable, specify mean and standard deviation.
  • If multivariate, specify mean vector and covariance matrix.
  • The parameters of our GMM are μi,Σi,P(Gi),i=1,,k\mu_i, \Sigma_i, P(G_i), i=1,\cdots,k. So the LHS of the above equation is more properly written as:

p(xθ)=p(xμ,Σi,P(Gi))p({\bf x}|\theta)=p({\bf x} | {\bf \mu}, \Sigma_i, P(G_i))

![](/images/notes/COMP4702/gaussian-sum.png)


Figure 1 - A Gaussian mixture model with 3 Gaussian distributions in a two-dimensional space.
  • The left plot Figure 1 describes that we can write a GMM as the weighted sum of 3 (in this case) Gaussian distributions.
  • These three smooth functions add up to create a single smooth function as shown in the center plot of Figure 1.
  • This GMM is plotted in three dimensions in the right of Figure 1.
  • What happens next is really all about whether or not you know apriori whether your data comes from some groups/categories/classes and if you have those “labels” for your data.

Figure 2 - A Gaussian mixture model with 3 Gaussian distributions in a two-dimensional space.
  • The above example assumes that you know the labels {red, blue}\in \lbrace\text{red, blue}\rbrace, shown on the left.
  • With these knowledge, you can separate these points out and create a Gaussian distribution for each class - shown by the contour lines in the plot to the right.
  • The weight of each probability distributed by the frequency of each class in the data.
  • Can use Maximum Likelihood Estimation (MLE) to estimate the parameters of each Gaussian distribution.

Predicting Output Labels for New Inputs: Discriminant Analysis

  • This section shows that you can use a GMM to do classification, because you can use your model of p(x,y)p({\bf x},y) to get p(yx)p(y|{\bf x}).

    • x\bf x being the coordinates and yy being the class labels.
  • This is a nice illustration for multi-class classification, bu you can just use a single Gaussian distribution for the binary classification and you’d be close to logistic regression

  • However, there are a few key differences:

    • Priors: not present in logistic regression. With Logistic Regression, it doesn’t explicitly encode the idea that one class has more samples than the other.

    • Different parametrisations of the covariance matrix give rise to Linear Discriminant Analysis (LDA) or Quadratic Discriminant Analysis (QDA) when you allow covariance modelling.

      • LDA when you have a diagonal Covariance matrix.

      Figure 3 - Difference in decision boundaries generated by LDAs and QDAs.

  • The figure above shows the difference between the decision boundaries generated by Linear Discriminant Analysis (LDA) vs Quadratic Discriminant Analysis (QDA).

    • Observe that in the figure to the left (LDA), the decision boundaries are linear whereas the decision boundaries in the figure to the right (QDA) are quadratic.

Semi-Supervised Learning of the Gaussian Mixture Model.

  • Where we have some data that is labelled, but also (probably a lot more) that is unlabelled.
    • We would like to make use of this unlabelled data as well as the labelled data.
  • We would like to maximise the likelihood but we can’t do this directly using the unlabelled data as shown in Equation 10.9:

θ^=argmaxθlnp({{xi,yi}i=1nl,{xi}i=nl+1n}τ)θ\hat\theta=\arg\max_\theta\ln p( \underbrace{\lbrace\lbrace{\bf x}_i, y_i\rbrace_{i=1}^{n_l}, \lbrace{\bf x}_i\rbrace_{i=n_l+1}^{n}\rbrace }_{\tau})| \theta

  • However, this problem has no closed-form solution
  • However, we can take the following approach:
    1. Learn the GMM from he labelled data.
    2. Use the GMM to predict p(y=mxi,θ^)p(y=m | {\bf x}_i,\hat{\bf\theta}) as a proxy for the labels for the unlabelled data.
    3. Update the GMM’s parameters including the data with predicted labels.
  • This concept works and isn’t just some random idea - it’s a version of the Expectation Maximisation (EM) algorithm
  • The equations relevant to this procedure are given as:

wi(m)={p(y=mxi,θ^) if yi is missing1 if yi=m0 otherwiseπ^m=1ni=1nwi(m)μ^m=1i=1nwi(m)i=1nwi(m)xiΣ^m=1i=1nwi(m)i=1nwi(m)(xiμ^m)(xiμ^m)T \begin{align*} w_i(m)&= \begin{cases} p(y=m|{\bf x}_i, \hat\theta) & \text{ if } y_i \text{ is missing}\\ 1 & \text{ if } y_i = m\\ 0 & \text{ otherwise} \end{cases} \tag{10.10a}\\ \hat\pi_m &= \frac{1}{n} \sum_{i=1}{n} w_i(m) \tag{10.10b}\\ \hat\mu_m &= \frac{1}{\sum_{i=1}^{n} w_i (m)} \sum_{i=1}^{n} w_i(m) {\bf x}_i \tag{10.10c}\\ \hat\Sigma_m &= \frac{1}{\sum_{i=1}^{n} w_i (m)} \sum_{i=1}^{n} w_i(m) ({\bf x}_i - \hat\mu_m) ({\bf x}_i - \hat\mu_m)^T \tag{10.10d} \end{align*}

Cluster Analysis

  • If we do not have a defined target variable (e.g., labels or numerical variables) we are in the realm of unsupervised learning.

  • The data consists of observations of features collected from some problem domain, and presumably contains structure and information that is of interest.

  • One way to learn about that structure is to build a model of the distribution of the data (i.e. probability density estimation).

  • A GMM is a flexible probability density estimator, and peaks of the density give us an understanding of the clustering of data (places in the feature space where data occurs with high probability).

  • The EM algorithm can be used to fit a GMM to data for x\bf x.

    • Effectively yy is marginalised out of the estimate p(x,y)p({\bf x},y) and becomes a latent variable for the model.
    • The latent variables are {yi}i=1n\lbrace y_i\rbrace_{i=1}^{n} where yi{1,,M}y_i\in\lbrace 1,\dots,M\rbrace is the cluster index for data point xi{\bf x}_i.
    • This algorithm is shown in Method 10.3 below:

Data Unlabelled training data T={xi}i=1n\mathcal{T}=\lbrace{\bf x}_i\rbrace_{i=1}^{n}, number of clusters MMResult Gaussian Mixture Model

  1. Initialise θ^={π^m,μ^m,Σ^m}m=1M\hat{\bf\theta}=\lbrace\hat\pi_m, \hat{\bf\mu}_m, \hat{\bf\Sigma}_m\rbrace_{m=1}^M
  2. repeat
  3.  |   For each xi{\bf x}_i in T={xi}i=1n\mathcal{T}=\lbrace{\bf x}_ i\rbrace_ {i=1}^{n} compute the prediction p(yxi,θ^)p(y|{\bf x}_ i, \hat{\bf\theta}) using Equation 10.5 using the current parameter estimate θ^\hat{\bf\theta}.
  4.  |   Update the parameter estimates θ^{π^m,μ^m,Σ^m}m=1M\hat{\bf\theta}\leftarrow\lbrace\hat\pi_m, \hat{\bf\mu}_m, \hat{\bf\Sigma}_m\rbrace_{m=1}^M using Equations 10.10.
  5. until convergence
  • In the “E step” (Line 3) We compute the “responsibility values” - how likely is it that a given data point could have been generated by each of the components of the mixture model? Note that the notation here is quite confusing: yy is actually a vector of values for each mixture component.

  • In the “M step” (Line 4) we update the model parameters θ\bf\theta using the responsibility values computed in the E step. Lindholm refer to the responsibility values as weights, wi(m)w_i(m)

  • Maximising the likelihood for a GMM is a non-convex optimisation problem, and the EM algorithm performs local optimisation which hopefully converges toward a stationary point

  • This works pretty well most of the time, though there is also a degeneracy in the problem - undesirable solutions where the likelihood diverges to infinity.

k-Means Clustering

  • We can understand that the kk-Means clustering algorithm as a simplified version of fitting a GMM using the EM algorithm.
  • The simplification made are:
    • Rather than calculating (soft, probabilistic) responsibility values, kk-Means calculates “hard” responsibility values based on the distances between a point and a set of kk cluster centers.
    • The distances in the GMM use the covariance information (i.e. the Mahalanobis distance between each data point and each component of the GMM)
  • The algorithm is given as the following steps:
    1. Set the cluster centers μ^1,μ^2,,μ^M\def\bmu{\bf\mu} \hat\bmu_1, \hat\bmu_2, \cdots, \hat\bmu_M to some initial values
    2. Determine which cluster RmR_m each xi{\bf x}_i belongs to by finding the cluster center μ^i\hat{\bf\mu}_i that is xi{\bf x}_i for all i{1,,n}i\in\lbrace 1,\dots,n\rbrace.$
    3. Update the cluster centers μ^m\hat{\bf\mu}_m as the average of all μi{\bf\mu}_i that belongs to RmR_m
    4. Repeat steps 2 and 3 until convergence

Choosing the Number of Clusters

  • Unfortunately no simple answer - sometimes problem domain knowledge can guide the choice.