# Overfitting

DS 352 Syllabus

last updated 08-Sep-2020

## Kinds of errors

Training errors (apparent errors)--Errors committed on the training set

Test errors -- Errors committed on the test set

Generalization errors -- Expected error of a model over random selection of records from same distribution

### Example: two class problem: • + : 5200 instances
5000 instances generated from a Gaussian distribution, centered at (10,10)
• o : 5200 instances
Generated from a uniform distribution

10 % of the data used for training and 90% of the data used for testing

## Decision tree with 5 nodes ### and with 50 nodes Comparison between. Which is better? ## Model Overfitting

Underfitting: when model is too simple, both training and test errors are large

Overfitting: when model is too complex, training error is small but test error is large
i.e. the model fits the training well but fails in general. If training data is under-representative, testing errors increase and training errors decrease on increasing number of nodes

Increasing the size of training data reduces the difference between training and testing errors at a given number of nodes ## Reasons for overfitting

Limited Training Size

High Complexity of the Model

• Multiple Comparison Problem

Many algorithms employ a greedy strategy of trying different altermative models and tracking the best model

• If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting

### Multiple Comparison Problem

Consider the task of predicting whether stock market will rise/fall in the next 10 trading days

Random guessing: P(correct) = 0.5

Make 10 random guesses in a row: Showing that you likely will find a great fit, use the following procedure

• Get 50 analysts
• Each analyst makes 10 random guesses
• Choose the analyst that makes the most number of correct predictions

Probability that at least one analyst makes at least 8 correct predictions: P(#correct≥8) = 1 - (1 - 0.0547)50 = 0.9333

### Effect of Multiple Comparisons.

Below on the left is fairly simple data set that should have a simple decision tree with just x and y attributes.

But if there are lots of noisy variables added, we can see fitting improves as more variables are added but the testing shows no improvement as shown in the upper left graph below.

### Notes on overfitting

Overfitting results in decision trees that are more complex than necessary.

Training error does not provide a good estimate of how well the tree will perform on previously test or future records.

Need ways for estimating generalization errors.

Model selection methodology can be used for estimating general errors.

## Model Selection

Performed during model building

Purpose is to ensure that model is not overly complex (to avoid overfitting)

Need to estimate generalization error:

• Using Validation Set
• Incorporating Model Complexity
• Estimating Statistical Bounds (skipping)

Approach is to split the training set into two partitions:

• Training set, used as before
• Validation set, used to generate the generalization error
• Caution: less observations to do the training

### Consider the complexity of the model

Rational: Occam's Razor: mathematics principle of given two models of similar generalization errors, one should prefer the simpler model over the more complex model

A complex model has a greater chance of being fitted accidentally by errors in data

Therefore, one should include model complexity when evaluating a model

Gen.Error(Model) = Train.Error(Model, Train.Data) + α * Complexity(Model)

Pessimistic Error Estimate of decision tree T with k leaf nodes: Where:

• err(T): error rate on all training records
• Ω: trade-off hyper-parameter (similar to α), representing the relative cost of adding a leaf node
• k: number of leaf nodes
• Ntrain: total number of training records

Example: Left tree has 4 wrong classifications, but with 7 nodes --> err()= 0.458

Right tree has 6 wrong classifications, with only 4 nodes --> err() = 0.417

So the right tree would be preferred.

There are other, similar, model selection approaches that we won't consider:

• Minimum Description Length (MDL)
• Statistical Bounds

## Model Selection for Decision Trees

Pre-Pruning (Early Stopping Rule)

Stop the algorithm before it becomes a fully-grown tree

Typical stopping conditions for a node:

• Stop if all instances belong to the same class
• Stop if all the attribute values are the same

More restrictive conditions:

• Stop if number of instances is less than some user-specified threshold
• Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
• Stop if estimated generalization error falls below certain threshold

Post-pruning

Grow decision tree to its entirety, then apply the following:

Subtree replacement

• Trim the nodes of the decision tree in a bottom-up fashion
• If generalization error improves after trimming, replace sub-tree by a leaf node
• Class label of leaf node is determined by the largest class of instances in the sub-tree

Subtree raising

• Replace subtree with most frequently used branch

Example, with Ω = 0.5  ## Model Evaluation

Purpose: To estimate performance of classifier on previously unseen data (test set)

Holdout

• Reserve k% for training and (100-k)% for testing
• Random subsampling: repeated holdout

Cross validation

• Partition data into k disjoint subsets
• k-fold: train on k-1 partitions, test on the remaining one
• Leave-one-out: k=n

## Cross Validation

Divide the data set into three or more even sized partitions (k partitions)

Cycle through each partition as the test set with the other k-1 partitions used as training.

The average error is used as the generalization error computation for the model

k=3 is shown below. ## Handling interactions ## Limitations of single attribute-based decision boundaries 