# Ensemble Methods

last updated 23-Jun-2021

Some notes taken from Vadim Smolyakov, Ensemble Learning to Improve Machine Learning Results
https://blog.statsbot.co/ensemble-learning-d1dcd548e936

## Ensemble Methods

Construct a set of classifiers from the training data

Predict class label of test records by combining the predictions made by multiple classifiers

• sequential methods where the base learners are generated sequentially (e.g. AdaBoost).
The basic motivation of sequential methods is to exploit the dependence between the base learners.
• parallel  methods where the base learners are generated in parallel (e.g. Random Forest).
The basic motivation of parallel methods is to exploit independence between the base learners since the error can be reduced dramatically by averaging.

### Why they work Suppose there are 25 base classifiers.

Each classifier has error rate, ε = 0.35

Assume errors made by classifiers are uncorrelated

Probability that the ensemble classifier makes a wrong prediction: which is better than 0.35.

## General Approach ## Types of ensemble methods

Manipulate data distribution in choosing a training data set for the individual models

• Example: bagging, boosting

Manipulate input features -- good for data sets with highly redundant features

• Example: random forests

Manipulate class labels -- partition multiple class values into two sets

• Example: error-correcting output coding

Manipulate the learning algorithms

• Alter the topology of an ANN
• Randomly choose one of the top k splitting attributes

## Bagging

Boostrap AGgregation

Sampling with replacement Build classifier on each bootstrap sample

Each sample has probability (1 – 1/n)n of being selected Consider 1-dimensional data set: Classifier is a decision stump

Decision rule: x ≤ k versus x > k

Split point k is chosen based on entropy   ### Summary Assume test set is the same as the original data

Use majority vote to determine class of ensemble classifier Comparisons of bagging for decision tree and K-nearest neighbor Accuracy: 0.63 (+/- 0.02) [Decision Tree]
Accuracy: 0.70 (+/- 0.02) [K-NN]
Accuracy: 0.64 (+/- 0.01) [Bagging Tree]
Accuracy: 0.59 (+/- 0.07) [Bagging K-NN]

The decision tree bagging ensemble achieved higher accuracy in comparison to the k-NN bagging ensemble.
K-NN are less sensitive to perturbation on training samples and therefore they are called stable learners.

## Boosting

An iterative procedure to adaptively change the distribution of training data determined by the sampling process by focusing more on previously misclassified records.

Initially, all N records are assigned equal weights

Unlike bagging, weights may change at the end of each boosting round

• Records that are wrongly classified will have their weights increased to improve their likelihood of being selected in later rounds.
• Records that are classified correctly will have their weights decreased to reduce their influence. Example 4 is hard to classify-- its weight is increased, therefore it is more likely to be chosen again in subsequent rounds ### Example

Consider 1-dimensional data set: Classifier is a decision stump

Decision rule: x ≤ k versus x > k

Split point k is chosen based on entropy   ## Random Forests

A commonly used class of ensemble algorithms are forests of randomized trees.

In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e. a bootstrap sample) from the training set. In addition, instead of using all the features, a random subset of features is selected, further randomizing the tree.

As a result, the bias of the forest increases slightly, but due to the averaging of less correlated trees, its variance decreases, resulting in an overall better model. The J48 (C4.5) or Apriori algorithm is determinant.

• The results are always the same with the same configuration parameters.
• You have to change the parameters to experiment or explore different results
• You can explicitly remove attributes to not participate in the tree construction, which might give you different, even better, results.

Random forest algorithm

• randomly choose a training set, for this instance of a tree (bagging)
• randomly choose a subset of attributes at each node from which to choose the "best" attribute to create a tree.
• a set of trees are then generated, because of the random choices at each node

Resulting set of decision trees have a lack of correlation.

The model then is applied to a test instance by taking a majority vote from each tree. That is, the instance is evaluated by each tree and the resulting value is the majority of the trees' conclusion.

## Empirical comparison among Ensemble Methods 