Lindholm et al, Chapter 7

Bagging and Boosting

Introduce a new way of constructing models, by constructing instances of some basic model - an “ensemble of base models”
Core idea in wisdom of crowds by training each model in a slightly different way, they can contribute to learning the input-output features.
To obtain a prediction use (weighted) average / majority vote.
If constructed carefully, predictions are better than the prediction of a single base model.

Bagging

More flexible a model (capable of representing complex input-output relationships) is, the lower the bias will be.
- Downside is the risk of overfitting or high model variance.
Using Bootstrap Aggregating (bagging) can reduce the variance of the base model without increasing its bias.
Because of the noise in the training data, we may think of the prediction $\hat y {\bf x_\star}$ as a random variable.
- Since in bagging each of the base models is trained on a different ‘version’ of the training data, the average of multiple realisations of a random variable has lower variance than the random variable itself.
That is, by taking the average, we obtain a prediction with less variance than a single prediction tree.

Bootstrap

The technique used to (artificially)create the different ‘versions’ of the base dataset.

The bootstrapping algorithm is as follows:

Data: Training dataset $\mathcal{T}=\lbrace{\bf{x}}_i,y_i\rbrace_{i=1}^{n}$

Result: Bootstrapped data $\tilde{\mathcal{T}}=\lbrace\tilde{\bf{x}}_i, \tilde{y}_i\rbrace_{i=1}^{n}$

For $i=1,\cdots, n$ do
| Sample $\ell$ uniformly on the set of integers $\lbrace1,\cdots,n\rbrace$
| Set $\tilde{\bf{x}}_i={\bf{x}}_\ell$ and $\tilde{y}_i=y_\ell$
end

Variance Reduction by Averaging

Traditionally used for quantifying uncertainties in statistical estimators.
Quantifying can be done by creating subsamples of the original dataset, by sampling with replacement.
By running the bootstrapping algorithm $B$ times, we obtain $B$ random but identically distributed bootstrapped datasets $\widetilde{\mathcal{T}}^{(1)}, \dots, \widetilde{\mathcal{T}}^{(B)}$
Bootstrapped datasets can then be used to train an ensemble of $B$ base models.
- $\hat y_\text{bag}$ refers to regression predictions
- ${\bf g}_\text{bag}$ refers to classification predictions (assuming the base classifiers output class probabilities)
- If the base classifiers do not output hard probabilities then we can take a majority vote amongst the base classifiers.
Make the observation that averaging random variables reduces the variance.
Consider a collection of random variables $z_1,\cdots,z_B$ .
- Let the mean and variance of these variables be $\mu$ and $\sigma^2$
- Let average correlation between pairs be $\rho$
Mean and the variance of these variables (e.g., ensemble output) given by Equation 7.2a and 7.2b

\mathbb{E}=\left[\frac{1}{B}\sum_{b=1}^{B} z_b\right]=\mu\tag{7.2a}

\text{Var}=\left[\frac{1}{B} \sum{b=1}{B} z_b\right]=\frac{1-\rho}{B} \sigma^2 + \rho\sigma^2\tag{7.2b}

We can see that:
- The mean is unaffected
- The variance decreases as $B$ increases
- The less correlated the variables are, the smaller the variance

\hat{y}_\text{bag}({\bf{x}_\star})=\frac{1}{B}\sum_{b=1}{B} \tilde{y}^{(b)}({\bf{x}_\star}) \ \ \ \ \ \ \ \ \ \text{or} \ \ \ \ \ \ \ \ \ {\bf{g}}_\text{bag}({\bf{x}_\star})=\frac{1}{B}\sum_{b=1}{B}\tilde{{\bf{g}}}^{(b)}({\bf{x}_\star}) \tag{7.1}

The training of the base models is given in the pseudocode below in Method 7.1

Data Training set $\mathcal{T} = \lbrace{\bf x}_i y_i\rbrace_{i=1}^{n}$ Result $B$ Base models

For $b=1,\dots,B$ do
| Run Algorithm 7.1 to obtain a bootstrapped training dataset $\widetilde{\mathcal{T}}^{(b)}$
| Learn a base model from $\widetilde{\mathcal{T}}^{(b)}$
end
Obtain $\hat y_{\text{bag}}$ or ${\bf g}_\text{bag}$ by averaging 7.1.

The prediction using base models is given as: Data $B$ base models and test input $\bf x_\star$ Result A prediction $\hat y_\text{bag} {\bf x_\star}$ or ${\bf g}_\text{bag}({\bf x})$

For $b=1,\dots,B$ do
| Use base model $b$ to predict $\widetilde{y}^{(b)}$ or $\widetilde{{\bf g}}^{(b)}({\bf x}_\star)$
end
Obtain $\hat y_{\text{bag}}$ or ${\bf g}_\text{bag}$ by averaging 7.1.

Despite the number of total parameters in the model increasing as $B$ $B$ increases, the lack of overfitting as $B\rightarrow\infty$ $B \to \infty$ is expected and is the intended behaviour.
- It is important to understand that more ensemble members does not make the resulting model more flexible - it only reduces the variance. -If each ensemble member overfits to its individual bootstrapped version of the training data, the averaging done across all ensemble members will result in larger training error but better generalisation.

Out-of-Bag Error Estimation

When using bagging or random forest, we can estimate $E_\text{new}$ without using cross-validation
When using bootstrapping, only 63% of original training points will be present in a given bootstrapped training set.
- This means that approximately $B/3$ ensemble members have not seen that training point - these are “out-of-bag” for that training point.
- This point can act as a test point since it has not been seen by any of these ensemble members.
- By computing the error for the out-of-bag ensemble members, we can estimate $E_\text{new}$ for this ensemble denoted $E_\text{OOB}^{(i)}$ .
- This single point will be a fairly poor estimate of $E_\text{new}$ , but averaging over all points will give a better estimate.
- If $B$ is large enough so that ensembles with $B$ and $B/3$ members perform similarly, then $E_\text{OOB}$ will be a good estimate of $E_\text{new}$ (which is at least as good as $E_\text{k-fold}$ )

Random Forests

Bagging reduce variance by averaging over multiple ensemble of models.
Variance reduction is limited by the correlation between individual ensemble members
However, we can reduce the correlation beyond what is possible with bagging which results in the random forest technique.
Whilst bagging is a general technique for any base model, random forests assume that base models are regression or classification trees.
- Idea is to inject additional randomness to reduce the correlation amongst the base models.
Let $\widetilde{\mathcal{T}}^{(b)}$ be one of the $B$ bootstrapped datasets in bagging.
- To train classification or regression tree, perform training as usual with one key difference
- Whenever about to split a node, do not consider all possible input variables - choose a random subset of $q\le p$ inputs and only consider $q$ variables as possible splitting variables.
This will cause the trees to be less correlated, and can therefore result in a larger variance reduction compared to bagging.
- Important to note that this perturbation (random selection of features to split upon) will result in a variance increase for each ensemble member.
To understand why it can be a good idea to only consider a subset of inputs as splitting variables, recall that tree-building is based on recursive binary splitting which is a greedy algorithm
- If we construct an ensemble of trees using bagging, it is very likely that the same input variable(s) will be split.
If using Random Forest instead, then some ensemble members will not be able to split on these inputs and will force these members to split on other inputs.
- This could prove useful later down in the splitting process which may result in better average performance.

Since Random Forest is a bagging method, bagging methods such as out-of-bag error estimation can be used.
As with Bagging, taking $B\rightarrow\infty$ doesn’t lead to overfitting - the only reason to choose a small value of $B$ is to reduce the computational cost.
However, since all trees are identically distributed, it is possible to parallelise the implementation of random forest learning.
The choice of $q$ $q$ is a hyperparameter, for which if $q=p$ $q = p$ then we recover the bagging method described previously.
- Rule of thumb, $q=\lfloor\sqrt p\rfloor$ for classification and $q=\lfloor p/3\rfloor$ for regression.

Boosting and Ada-Boosting

Boosting is another ensemble method used for reducing bias in high-bias base models like classification trees of depth 1.
Built on the idea that even a weak, high-bias model can capture some of the relationship between input and output.
By training multiple weak models, each of which describing the input-output relationship might be possible to combine the predictions of these models into an overall better prediction
- Reduce the bias by turning an ensemble of weak models into one strong model.
In bagging, train $B$ identically distributed models in parallel.
Boosting trains models sequentially, in a way that each model tries to correct the mistakes made by the previous.
- Modify the training dataset at each iteration to put more emphasis on data points that the model (so far) has performed poorly on.
Final prediction obtained by taking weighted average or weighted majority vote on the models (weight is the data penalty from before in example)

AdaBoost

First successful implementation of boosting algorithm

General idea to construct a sequence of $B$ weak binary classifiers $\hat y^{(1)}({\bf x}), \hat y^{(2)}({\bf x}), \dots, \hat y^{(B)}({\bf x})$ .
For the procedure, only consider final ‘hard’ predictions from the base models and not predicted class probabilities.
Ensemble members are not treated equally - assign some positive coefficients $\lbrace\alpha^{(b)}\rbrace_{b=1}^{B}$ and construct boosted classifier using weighted majority vote.

\hat y_\text{boost}^{B}=\text{sign}\left\lbrace \sum_{b=1}^{B} \alpha^{(b)} \hat y^{(b)}({\bf x})\right\rbrace\tag{7.4}

Each ensemble member votes either $-1$ $- 1$ or $+1$ $+ 1$ and the output from the boosted classifier is $+1$ $+ 1$ if the weighted sum of the individual votes is positive and $-1$ $- 1$ if it is negative
- The coefficient $\alpha^{(b)}$ can be thought of as a degree of confidence in the predictions made by the $b$ th ensemble member.
In AdaBoost, ensemble members and their coefficients $\alpha^{(b)}$ are trained greedily by minimising the exponential loss of the boosted classifier at each iteration.
The exponential loss function is given as Equation 5.15 and duplicated below, whereby $y\cdot f({\bf x})$ is the margin of the classifier.

L(y \cdot f({\bf x})) = \exp(-y \cdot f({\bf x}))\tag{7.5}

The ensemble members are added one at a time, and when member $b$ is added, this is done to minimise the exponential loss of the entire ensemble constructed so far.
Exponential loss function chosen is that it results in convenient closed-form expressions.
The boosted classifier after $b$ iterations is given as $\hat y_\text{boost}^{(b)}({\bf x})=\text{sign} \{f^{(b)} ({\bf x})\}$ where $f^{(b)}({\bf x})=sum_{j=1}^{b} \alpha^{(j)} \hat y^{(j)} ({\bf x})$ .
$f^{(b)}({\bf x})$ can be expressed iteratively as

f^{(b)}({\bf x})=f^{(b-1)}({\bf x})+\alpha^{(b)} \hat y^{(b)}({\bf x})\tag{7.6}

The ensemble members are constructed sequentially which means that at iteration $b$ $b$ , the function $f^{(b-1)}({\bf x})$ $f^{(b - 1)} (x)$ is fixed.
- Therefore, in iteration $b$ , we must learn the ensemble member $\hat y^{(b)}({\bf x})$ and its coefficient $\alpha^{(b)}$ .

\begin{align*} (\alpha^{(b)}, \hat y^{(b)}) &= \arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} L(y_i \cdot f^{(b)}({\bf x}_i)) \tag{7.7a}\\ &= \arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} \exp \left(-y_i \left(f^{(b-1)}({\bf x}_i) + \alpha \hat y({\bf x}_i)\right)\right) \tag{7.7b}\\ &=\arg\min_{(\alpha, \hat y)} \sum_{i=1}^{n} \underbrace{\exp (-y_i) f^{(b-1)}({\bf x}_i)}_{=w_i^{(b-1)}} \exp (-y_i \alpha \hat y({\bf x}_i)) \tag{7.7c}\\ \end{align*}

In the equation above:

Use the definition of exponential loss and the sequential structure of the boosted classifier
We use the exponential loss function property $\exp (a+b)=\exp(a)\exp(b)$ to define the following quantity which are interpreted as the individual weights of the data points in the training dataset as per Equation 7.8

Note that these weights are independent of $\alpha$ and $\hat y$ - that is, when learning $\hat y^{(b)}({\bf x})$ and its coefficient $\alpha^{(b)}$ (by solving Equation 7.7c), we can regard the weights $w_i^{(b-1)}$ as constants.

w_i^{(b)} \overset{\text{def}}{=} \exp(-y_i f^{(b-1)} ({\bf x}_i)) \tag{7.8}

At this point, they continue the AdaBoost training derivation - I don’t think this is particular relevant

We finally arrive at the following optimisation problem.

\hat y^{(b)}=\arg\min_{\hat y} \sum_{i=1}^{n} w_i^{(b)} \mathbb{I}(y_i \neq \hat y({\bf x}_i)) \tag{7.9}

That is, the $b$ th ensemble member should be trained by minimising the weighted misclassification loss, where each datapoint assigned a weight $w_i^{(b)}$ .
The weights force the model to focus on the mistakes (previous misclassifications) by the ensemble of the first $b-1$ classifiers.

In practice the way that the optimisation problem is solved depends on the choice of base classifier.

Solving Equation 7.11 is almost identical to solving the standard classification problem, with the only difference being that the training dataset is weighted.
Training an ensemble member on a weighted classification problem (for most base classifiers) is straightforward.
Since most classifiers are trained by minimising some cost functions, this problem reduces to weighting the individual terms of the cost function and solving that modified problem.
- That is, we use some modified weighted loss function.
The pseudocode for training an AdaBoost classifier is given as:

The AdaBoost training algorithm is given as:

Data: Training data $\mathcal{T}-\lbrace{\bf{x}}_i, y_i\rbrace_{i=1}^{n}$

Result: $B$ weak classifiers

Assign weights $w_i^{(1)}=\frac{1}{n}$ to all datapoints
for $b=1,\cdots,B$ do
| Train a weak classifier $\hat{y}^{(b)}({\bf{x}})$ on the weighted training data denoted $\lbrace ({\bf{x}}_i, y_i, w_i^{(b)})\rbrace_{i=1}^{n}$
| Compute $E_\text{train}^{(b)}=\sum_{i=1}^{n} w_i^{(b)} \mathbb{I}\lbrace y_i\ne\hat{y}^{(b)}({\bf{x}}_i)\rbrace$
| Compute $\alpha^{(b)}=0.5\ln((1-E_\text{train}^{(b)})/E_\text{train}^{(b)})$
| Compute $w_i^{(b+1)}=w_i^{(b)} \exp (-\alpha^{(b)} y_i \hat{y}^{(b)}({\bf{x}})), y=1,\cdots,n$
| Set $w_i^{(b+1)}\leftarrow =w_i^{(b)} / \sum_{j=1}^{n}$ (normalisation of weights)

From lines 5 and 6 of the AdaBoost algorithm, we cna draw the following conclusions:

AdaBoost trains the ensemble by minimising an exponential loss function of the boosted classifier at each iteration - the loss function is shown in Equation 7.5 below.
$L(y \cdot f({\bf x})) = \exp (-y \cdot f({\bf x})) \tag{7.5}$
- The fact that this is an exponential function makes the math work out nicely.
Part of the derivation shows that we can do the optimisation using the weighted misclassification loss at each iteration (Line 4).

The AdaBoost prediction algorithm is as follows:

Data: $B$ weak classifiers with confidence values $\lbrace\hat{y}^{(b)}({\bf{x}}),\alpha^{(b)}\rbrace_{b=1}^{B}$ and test input $\bf x_\star$

Result: Prediction $\hat{y}_\text{boost}^{(B)}({\bf x_\star})$

Output $\hat{y}_\text{boost}^{(B)}({\bf x_\star})=\text{sign}\lbrace\sum_{b=1}^{B}\alpha^{(b)}\hat{y}^{(b)}({\bf{x}_\star})$ -->

Gradient Boosting

AdaBoost performs well if there is little noise in the data.
- Performance deteriorates if there is a lot of noise in the data (because of exponential loss function).
Can use a more robust loss function which comes at the cost of computation.
Instead of saying that we learn a sequence of weak classifiers, where each classifier tries to correct the mistakes made by the previous ones we can say that we construct an “additive model” of the form:

f^{(B)}({\bf x}) = \sum^{b=1}_{B} \alpha^{(b)} f^{(b)}({\bf x}) \tag{7.14}

In this equation $\alpha^{(b)}$ are real-valued coefficients and $f^{(b)}({\bf x})$ are basis functions.
- For regression, $f^{(B)}({\bf x})$ can directly be used as the model’s prediction
- For classification, can be threshold-ed by a sign function, or transformed in to class probabilities using a logistic function.
If basis functions are fixed, then just need to learn the $\alpha$ values.
- However, we can obtain a more flexible model by allowing the basis functions to be learnable just as in a NN.
  - That is, a two-layer regression NN is an additive model.
There are two properties that distinguish Boosting from other additive models
- Basis functions are learned from data - each function corresponds to a machine learning model itself - the base model of the boosting procedure
- The basis functions and coefficients are learned sequentially - add one component to the sum in Equation 7.15, and after $B$ iterations, the algorithm terminates.
The goal when training an additive model is to select $\lbrace\alpha^{(b)}, f^{(b)}, ({\bf x})\rbrace_{b=1}^{B}$ such that the final model minimises the following cost function:

J(f({\bf X})) = \frac 1n \sum_{i=1}^{n} L(y_i, f({\bf x}_i)) \tag{7.15}