# Imbalanced Class Problem

last updated 23-Jun-2021

## Introduction

Lots of classification problems where the classes are skewed (significantly more records from one class than another).  This has connections with anomoly detection we've covered earlier, but later in the book.

• Credit card fraud
• Intrusion detection
• Defective products in manufacturing assembly line

Challenges

• Simple evaluation measures such as accuracy are not well-suited for imbalanced class
• Detecting the rare class is like finding needle in a haystack

Fixing the imbalance option (sampling based approach)

• Undersampling of the majority class
• Oversampling of the minority by creating artificial examples to equalize the balance

## Confusion Matrix

a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)

Accuracy is the most widely-used metric:

## Problem wth Accuracy

Consider a 2-class problem

• Number of Class NO examples = 990
• Number of Class YES examples = 10

If a model predicts everything to be class NO, accuracy is 990/1000 = 99 %

• This is misleading because the model does not detect any class YES example
• Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc)

## Alternative Measures

The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:

F = 1 / (1/r + 1/p)/2 )

[from Wikipedia]

## Measures of Classification Performance

α is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP).

β is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN).

Confusion matrix — The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

 Predicted class + - Actual class + TP True Positives FN False Negatives Type II error - FP False Positives Type I error TN True Negatives

Main metrics — The following additional metrics are commonly used to assess the performance of classification models:

 Metric Interpretation Accuracy Overall performance of model Precision How accurate the positive predictions are Recall Sensitivity Coverage of actual positive sample Specificity Coverage of actual negative sample F1 score Hybrid metric useful for unbalanced classes

ROC is a graphical approach for displaying trade-off between detection rate and false alarm rate.

Developed in 1950s for signal detection theory to analyze noisy signals .

ROC curve plots TPR against FPR

• Performance of a model represented as a point in an ROC curve
• Changing the threshold parameter of classifier changes the location of the point

(TPR,FPR):

• (0,0): declare everything to be negative class
• (1,1): declare everything to be positive class
• (1,0): ideal

Diagonal line represents random guessing

• Below diagonal line prediction is opposite of the true class

To draw ROC curve, classifier must produce continuous-valued output

• Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record

Many classifiers produce only discrete outputs (i.e., predicted class)

• How to get continuous-valued outputs?
• Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM

- 1-dimensional data set containing 2 classes (positive and negative)

- Any points located at x > t is classified as positive (threshold = t)

ROC — The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

 Metric Formula Equivalent True Positive Rate TPR TP/ TP+FN Recall, sensitivity False Positive Rate FPR FP/ TN+FP 1-specificity

AUC — The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:

## Using ROC for model comparison

In this example, no model consistently outperforms the other

• M1 is better for small FPR
• M2 is better for large FPR

Area under the ROC curve serves as the comparison measure

Ideal: Area = 1

Random guess: Area = 0.5

## Constructing a ROC curve

Use a classifier that produces a continuous-valued score for each instance

• The more likely it is for the instance to be in the + class, the higher the score

Sort the instances in decreasing order according to the score

Apply a threshold at each unique value of the score

Count the number of TP, FP, TN, FN at each threshold

• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)