Lindholm et al, Chapter 7

CourseMachine Learning
SemesterS1 2023

Lindholm et al, Chapter 7

Bagging and Boosting

  • Introduce a new way of constructing models, by constructing instances of some basic model - an “ensemble of base models
  • Core idea in wisdom of crowds by training each model in a slightly different way, they can contribute to learning the input-output features.
  • To obtain a prediction use (weighted) average / majority vote.
  • If constructed carefully, predictions are better than the prediction of a single base model.

Bagging

  • More flexible a model (capable of representing complex input-output relationships) is, the lower the bias will be.

    • Downside is the risk of overfitting or high model variance.
  • Using Bootstrap Aggregating (bagging) can reduce the variance of the base model without increasing its bias.

  • Because of the noise in the training data, we may think of the prediction y^x\hat y {\bf x_\star} as a random variable.

    • Since in bagging each of the base models is trained on a different ‘version’ of the training data, the average of multiple realisations of a random variable has lower variance than the random variable itself.
  • That is, by taking the average, we obtain a prediction with less variance than a single prediction tree.

Bootstrap

The technique used to (artificially)create the different ‘versions’ of the base dataset.

  • The bootstrapping algorithm is as follows:

Data: Training dataset T={xi,yi}i=1n\mathcal{T}=\lbrace{\bf{x}}_i,y_i\rbrace_{i=1}^{n}

Result: Bootstrapped data T~={x~i,y~i}i=1n\tilde{\mathcal{T}}=\lbrace\tilde{\bf{x}}_i, \tilde{y}_i\rbrace_{i=1}^{n}

  1. For i=1,,ni=1,\cdots, n do
  2.  |   Sample \ell uniformly on the set of integers {1,,n}\lbrace1,\cdots,n\rbrace
  3.  |   Set x~i=x\tilde{\bf{x}}_i={\bf{x}}_\ell and y~i=y\tilde{y}_i=y_\ell
  4. end

Variance Reduction by Averaging

  • Traditionally used for quantifying uncertainties in statistical estimators.

  • Quantifying can be done by creating subsamples of the original dataset, by sampling with replacement.

  • By running the bootstrapping algorithm BB times, we obtain BB random but identically distributed bootstrapped datasets T~(1),,T~(B)\widetilde{\mathcal{T}}^{(1)}, \dots, \widetilde{\mathcal{T}}^{(B)}

  • Bootstrapped datasets can then be used to train an ensemble of BB base models.

    • y^bag\hat y_\text{bag} refers to regression predictions
    • gbag{\bf g}_\text{bag} refers to classification predictions (assuming the base classifiers output class probabilities)
    • If the base classifiers do not output hard probabilities then we can take a majority vote amongst the base classifiers.
  • Make the observation that averaging random variables reduces the variance.

  • Consider a collection of random variables z1,,zBz_1,\cdots,z_B.

    • Let the mean and variance of these variables be μ\mu and σ2\sigma^2
    • Let average correlation between pairs be ρ\rho
  • Mean and the variance of these variables (e.g., ensemble output) given by Equation 7.2a and 7.2b

E=[1Bb=1Bzb]=μ(7.2a)\mathbb{E}=\left[\frac{1}{B}\sum_{b=1}^{B} z_b\right]=\mu\tag{7.2a}

Var=[1Bb=1Bzb]=1ρBσ2+ρσ2(7.2b)\text{Var}=\left[\frac{1}{B} \sum{b=1}{B} z_b\right]=\frac{1-\rho}{B} \sigma^2 + \rho\sigma^2\tag{7.2b}

  • We can see that:
    • The mean is unaffected
    • The variance decreases as BB increases
    • The less correlated the variables are, the smaller the variance

y^bag(x)=1Bb=1By~(b)(x)         or         gbag(x)=1Bb=1Bg~(b)(x)(7.1)\hat{y}_\text{bag}({\bf{x}_\star})=\frac{1}{B}\sum_{b=1}{B} \tilde{y}^{(b)}({\bf{x}_\star}) \ \ \ \ \ \ \ \ \ \text{or} \ \ \ \ \ \ \ \ \ {\bf{g}}_\text{bag}({\bf{x}_\star})=\frac{1}{B}\sum_{b=1}{B}\tilde{{\bf{g}}}^{(b)}({\bf{x}_\star}) \tag{7.1}

The training of the base models is given in the pseudocode below in Method 7.1

Data Training set T={xiyi}i=1n\mathcal{T} = \lbrace{\bf x}_i y_i\rbrace_{i=1}^{n}Result BB Base models

  1. For b=1,,Bb=1,\dots,B do
  2.  |   Run Algorithm 7.1 to obtain a bootstrapped training dataset T~(b)\widetilde{\mathcal{T}}^{(b)}
  3.  |   Learn a base model from T~(b)\widetilde{\mathcal{T}}^{(b)}
  4. end
  5. Obtain y^bag\hat y_{\text{bag}} or gbag{\bf g}_\text{bag} by averaging 7.1.

The prediction using base models is given as: Data BB base models and test input x\bf x_\starResult A prediction y^bagx\hat y_\text{bag} {\bf x_\star} or gbag(x){\bf g}_\text{bag}({\bf x})

  1. For b=1,,Bb=1,\dots,B do
  2.  |   Use base model bb to predict y~(b)\widetilde{y}^{(b)} or g~(b)(x)\widetilde{{\bf g}}^{(b)}({\bf x}_\star)
  3. end
  4. Obtain y^bag\hat y_{\text{bag}} or gbag{\bf g}_\text{bag} by averaging 7.1.

  • Despite the number of total parameters in the model increasing as BB increases, the lack of overfitting as BB\rightarrow\infty is expected and is the intended behaviour.
    • It is important to understand that more ensemble members does not make the resulting model more flexible - it only reduces the variance. -If each ensemble member overfits to its individual bootstrapped version of the training data, the averaging done across all ensemble members will result in larger training error but better generalisation.

Out-of-Bag Error Estimation

  • When using bagging or random forest, we can estimate EnewE_\text{new} without using cross-validation
  • When using bootstrapping, only 63% of original training points will be present in a given bootstrapped training set.
    • This means that approximately B/3B/3 ensemble members have not seen that training point - these are “out-of-bag” for that training point.
    • This point can act as a test point since it has not been seen by any of these ensemble members.
    • By computing the error for the out-of-bag ensemble members, we can estimate EnewE_\text{new} for this ensemble denoted EOOB(i)E_\text{OOB}^{(i)}.
    • This single point will be a fairly poor estimate of EnewE_\text{new}, but averaging over all points will give a better estimate.
    • If BB is large enough so that ensembles with BB and B/3B/3 members perform similarly, then EOOBE_\text{OOB} will be a good estimate of EnewE_\text{new} (which is at least as good as Ek-foldE_\text{k-fold})

Random Forests

  • Bagging reduce variance by averaging over multiple ensemble of models.

  • Variance reduction is limited by the correlation between individual ensemble members

  • However, we can reduce the correlation beyond what is possible with bagging which results in the random forest technique.

  • Whilst bagging is a general technique for any base model, random forests assume that base models are regression or classification trees.

    • Idea is to inject additional randomness to reduce the correlation amongst the base models.
  • Let T~(b)\widetilde{\mathcal{T}}^{(b)} be one of the BB bootstrapped datasets in bagging.

    • To train classification or regression tree, perform training as usual with one key difference
    • Whenever about to split a node, do not consider all possible input variables - choose a random subset of qpq\le p inputs and only consider qq variables as possible splitting variables.
  • This will cause the trees to be less correlated, and can therefore result in a larger variance reduction compared to bagging.

    • Important to note that this perturbation (random selection of features to split upon) will result in a variance increase for each ensemble member.
  • To understand why it can be a good idea to only consider a subset of inputs as splitting variables, recall that tree-building is based on recursive binary splitting which is a greedy algorithm

    • If we construct an ensemble of trees using bagging, it is very likely that the same input variable(s) will be split.
  • If using Random Forest instead, then some ensemble members will not be able to split on these inputs and will force these members to split on other inputs.

    • This could prove useful later down in the splitting process which may result in better average performance.

  • Since Random Forest is a bagging method, bagging methods such as out-of-bag error estimation can be used.
  • As with Bagging, taking BB\rightarrow\infty doesn’t lead to overfitting - the only reason to choose a small value of BB is to reduce the computational cost.
  • However, since all trees are identically distributed, it is possible to parallelise the implementation of random forest learning.
  • The choice of qq is a hyperparameter, for which if q=pq=p then we recover the bagging method described previously.
    • Rule of thumb, q=pq=\lfloor\sqrt p\rfloor for classification and q=p/3q=\lfloor p/3\rfloor for regression.

Boosting and Ada-Boosting

  • Boosting is another ensemble method used for reducing bias in high-bias base models like classification trees of depth 1.

  • Built on the idea that even a weak, high-bias model can capture some of the relationship between input and output.

  • By training multiple weak models, each of which describing the input-output relationship might be possible to combine the predictions of these models into an overall better prediction

    • Reduce the bias by turning an ensemble of weak models into one strong model.
  • In bagging, train BB identically distributed models in parallel.

  • Boosting trains models sequentially, in a way that each model tries to correct the mistakes made by the previous.

    • Modify the training dataset at each iteration to put more emphasis on data points that the model (so far) has performed poorly on.
  • Final prediction obtained by taking weighted average or weighted majority vote on the models (weight is the data penalty from before in example)

AdaBoost

First successful implementation of boosting algorithm

  • General idea to construct a sequence of BB weak binary classifiers y^(1)(x),y^(2)(x),,y^(B)(x)\hat y^{(1)}({\bf x}), \hat y^{(2)}({\bf x}), \dots, \hat y^{(B)}({\bf x}).
  • For the procedure, only consider final ‘hard’ predictions from the base models and not predicted class probabilities.
  • Ensemble members are not treated equally - assign some positive coefficients {α(b)}b=1B\lbrace\alpha^{(b)}\rbrace_{b=1}^{B} and construct boosted classifier using weighted majority vote.

y^boostB=sign{b=1Bα(b)y^(b)(x)}(7.4)\hat y_\text{boost}^{B}=\text{sign}\left\lbrace \sum_{b=1}^{B} \alpha^{(b)} \hat y^{(b)}({\bf x})\right\rbrace\tag{7.4}

  • Each ensemble member votes either 1-1 or +1+1 and the output from the boosted classifier is +1+1 if the weighted sum of the individual votes is positive and 1-1 if it is negative
    • The coefficient α(b)\alpha^{(b)} can be thought of as a degree of confidence in the predictions made by the bbth ensemble member.
  • In AdaBoost, ensemble members and their coefficients α(b)\alpha^{(b)} are trained greedily by minimising the exponential loss of the boosted classifier at each iteration.
  • The exponential loss function is given as Equation 5.15 and duplicated below, whereby yf(x)y\cdot f({\bf x}) is the margin of the classifier.

L(yf(x))=exp(yf(x))(7.5)L(y \cdot f({\bf x})) = \exp(-y \cdot f({\bf x}))\tag{7.5}

  • The ensemble members are added one at a time, and when member bb is added, this is done to minimise the exponential loss of the entire ensemble constructed so far.
  • Exponential loss function chosen is that it results in convenient closed-form expressions.
  • The boosted classifier after bb iterations is given as y^boost(b)(x)=sign{f(b)(x)}\hat y_\text{boost}^{(b)}({\bf x})=\text{sign} \{f^{(b)} ({\bf x})\} where f(b)(x)=sumj=1bα(j)y^(j)(x)f^{(b)}({\bf x})=sum_{j=1}^{b} \alpha^{(j)} \hat y^{(j)} ({\bf x}).
  • f(b)(x)f^{(b)}({\bf x}) can be expressed iteratively as

f(b)(x)=f(b1)(x)+α(b)y^(b)(x)(7.6) f^{(b)}({\bf x})=f^{(b-1)}({\bf x})+\alpha^{(b)} \hat y^{(b)}({\bf x})\tag{7.6}

  • The ensemble members are constructed sequentially which means that at iteration bb, the function f(b1)(x)f^{(b-1)}({\bf x}) is fixed.
    • Therefore, in iteration bb, we must learn the ensemble member y^(b)(x)\hat y^{(b)}({\bf x}) and its coefficient α(b)\alpha^{(b)}.

(α(b),y^(b))=argmin(α,y^)i=1nL(yif(b)(xi))=argmin(α,y^)i=1nexp(yi(f(b1)(xi)+αy^(xi)))=argmin(α,y^)i=1nexp(yi)f(b1)(xi)=wi(b1)exp(yiαy^(xi))\begin{align*} (\alpha^{(b)}, \hat y^{(b)}) &= \arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} L(y_i \cdot f^{(b)}({\bf x}_i)) \tag{7.7a}\\ &= \arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} \exp \left(-y_i \left(f^{(b-1)}({\bf x}_i) + \alpha \hat y({\bf x}_i)\right)\right) \tag{7.7b}\\ &=\arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} \underbrace{\exp (-y_i) f^{(b-1)}({\bf x}_i)}_{=w_i^{(b-1)}} \exp (-y_i \alpha \hat y({\bf x}_i)) \tag{7.7c}\\ \end{align*}

In the equation above:

  1. Use the definition of exponential loss and the sequential structure of the boosted classifier
  2. We use the exponential loss function property exp(a+b)=exp(a)exp(b)\exp (a+b)=\exp(a)\exp(b) to define the following quantity which are interpreted as the individual weights of the data points in the training dataset as per Equation 7.8
  • Note that these weights are independent of α\alpha and y^\hat y - that is, when learning y^(b)(x)\hat y^{(b)}({\bf x}) and its coefficient α(b)\alpha^{(b)} (by solving Equation 7.7c), we can regard the weights wi(b1)w_i^{(b-1)} as constants.

wi(b)=defexp(yif(b1)(xi))(7.8)w_i^{(b)} \overset{\text{def}}{=} \exp(-y_i f^{(b-1)} ({\bf x}_i)) \tag{7.8}

At this point, they continue the AdaBoost training derivation - I don’t think this is particular relevant

  • We finally arrive at the following optimisation problem.

y^(b)=argminy^i=1nwi(b)I(yiy^(xi))(7.9)\hat y^{(b)}=\arg\min_{\hat y} \sum_{i=1}^{n} w_i^{(b)} \mathbb{I}(y_i \neq \hat y({\bf x}_i)) \tag{7.9}

  • That is, the bbth ensemble member should be trained by minimising the weighted misclassification loss, where each datapoint assigned a weight wi(b)w_i^{(b)}.
  • The weights force the model to focus on the mistakes (previous misclassifications) by the ensemble of the first b1b-1 classifiers.

In practice the way that the optimisation problem is solved depends on the choice of base classifier.

  • Solving Equation 7.11 is almost identical to solving the standard classification problem, with the only difference being that the training dataset is weighted.

  • Training an ensemble member on a weighted classification problem (for most base classifiers) is straightforward.

  • Since most classifiers are trained by minimising some cost functions, this problem reduces to weighting the individual terms of the cost function and solving that modified problem.

    • That is, we use some modified weighted loss function.
  • The pseudocode for training an AdaBoost classifier is given as:


The AdaBoost training algorithm is given as:

Data: Training data T{xi,yi}i=1n\mathcal{T}-\lbrace{\bf{x}}_i, y_i\rbrace_{i=1}^{n}

Result: BB weak classifiers

  1. Assign weights wi(1)=1nw_i^{(1)}=\frac{1}{n} to all datapoints
  2. for b=1,,Bb=1,\cdots,B do
  3.  |    Train a weak classifier y^(b)(x)\hat{y}^{(b)}({\bf{x}}) on the weighted training data denoted {(xi,yi,wi(b))}i=1n\lbrace ({\bf{x}}_i, y_i, w_i^{(b)})\rbrace_{i=1}^{n}
  4.  |    Compute Etrain(b)=i=1nwi(b)I{yiy^(b)(xi)}E_\text{train}^{(b)}=\sum_{i=1}^{n} w_i^{(b)} \mathbb{I}\lbrace y_i\ne\hat{y}^{(b)}({\bf{x}}_i)\rbrace
  5.  |    Compute α(b)=0.5ln((1Etrain(b))/Etrain(b))\alpha^{(b)}=0.5\ln((1-E_\text{train}^{(b)})/E_\text{train}^{(b)})
  6.  |    Compute wi(b+1)=wi(b)exp(α(b)yiy^(b)(x)),y=1,,nw_i^{(b+1)}=w_i^{(b)} \exp (-\alpha^{(b)} y_i \hat{y}^{(b)}({\bf{x}})), y=1,\cdots,n
  7.  |    Set wi(b+1)=wi(b)/j=1nw_i^{(b+1)}\leftarrow =w_i^{(b)} / \sum_{j=1}^{n} (normalisation of weights)

From lines 5 and 6 of the AdaBoost algorithm, we cna draw the following conclusions:

  • AdaBoost trains the ensemble by minimising an exponential loss function of the boosted classifier at each iteration - the loss function is shown in Equation 7.5 below.

    L(yf(x))=exp(yf(x))(7.5) L(y \cdot f({\bf x})) = \exp (-y \cdot f({\bf x})) \tag{7.5}

    • The fact that this is an exponential function makes the math work out nicely.
  • Part of the derivation shows that we can do the optimisation using the weighted misclassification loss at each iteration (Line 4).

The AdaBoost prediction algorithm is as follows:

Data: BB weak classifiers with confidence values {y^(b)(x),α(b)}b=1B\lbrace\hat{y}^{(b)}({\bf{x}}),\alpha^{(b)}\rbrace_{b=1}^{B} and test input x\bf x_\star

Result: Prediction y^boost(B)(x)\hat{y}_\text{boost}^{(B)}({\bf x_\star})

  1. Output y^boost(B)(x)=sign{b=1Bα(b)y^(b)(x)\hat{y}_\text{boost}^{(B)}({\bf x_\star})=\text{sign}\lbrace\sum_{b=1}^{B}\alpha^{(b)}\hat{y}^{(b)}({\bf{x}_\star}) -->

Gradient Boosting

  • AdaBoost performs well if there is little noise in the data.
    • Performance deteriorates if there is a lot of noise in the data (because of exponential loss function).
  • Can use a more robust loss function which comes at the cost of computation.
  • Instead of saying that we learn a sequence of weak classifiers, where each classifier tries to correct the mistakes made by the previous ones we can say that we construct an “additive model” of the form:

f(B)(x)=Bb=1α(b)f(b)(x)(7.14)f^{(B)}({\bf x}) = \sum^{b=1}_{B} \alpha^{(b)} f^{(b)}({\bf x}) \tag{7.14}

  • In this equation α(b)\alpha^{(b)} are real-valued coefficients and f(b)(x)f^{(b)}({\bf x}) are basis functions.

    • For regression, f(B)(x)f^{(B)}({\bf x}) can directly be used as the model’s prediction
    • For classification, can be threshold-ed by a sign function, or transformed in to class probabilities using a logistic function.
  • If basis functions are fixed, then just need to learn the α\alpha values.

    • However, we can obtain a more flexible model by allowing the basis functions to be learnable just as in a NN.
      • That is, a two-layer regression NN is an additive model.
  • There are two properties that distinguish Boosting from other additive models

    • Basis functions are learned from data - each function corresponds to a machine learning model itself - the base model of the boosting procedure
    • The basis functions and coefficients are learned sequentially - add one component to the sum in Equation 7.15, and after BB iterations, the algorithm terminates.
  • The goal when training an additive model is to select {α(b),f(b),(x)}b=1B\lbrace\alpha^{(b)}, f^{(b)}, ({\bf x})\rbrace_{b=1}^{B} such that the final model minimises the following cost function:

J(f(X))=1ni=1nL(yi,f(xi))(7.15) J(f({\bf X})) = \frac 1n \sum_{i=1}^{n} L(y_i, f({\bf x}_i)) \tag{7.15}