Imbalanced Class Problem

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Introduction

Lots of classification problems where the classes are skewed (significantly more records from one class than another).  This has connections with anomoly detection we've covered earlier, but later in the book.

Challenges

Fixing the imbalance option (sampling based approach)

 

Confusion Matrix

a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)

Accuracy is the most widely-used metric:

 

Problem wth Accuracy

Consider a 2-class problem

If a model predicts everything to be class NO, accuracy is 990/1000 = 99 %


Alternative Measures

The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:

F = 1 / (1/r + 1/p)/2 )

[from Wikipedia]

 

 

Examples

 

Measures of Classification Performance

α is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP).

β is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN).

 

Confusion matrix  The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

Predicted class
+ -
Actual class + TP
True Positives
FN
False Negatives
Type II error
- FP
False Positives
Type I error
TN
True Negatives

Main metrics  The following additional metrics are commonly used to assess the performance of classification models:

 

 

Metric Interpretation
Accuracy Overall performance of model
Precision How accurate the positive predictions are
Recall
Sensitivity
Coverage of actual positive sample
Specificity Coverage of actual negative sample
F1 score Hybrid metric useful for unbalanced classes

 

 


 

 

 


ROC (Receiver Operating Characteristic)

ROC is a graphical approach for displaying trade-off between detection rate and false alarm rate.

Developed in 1950s for signal detection theory to analyze noisy signals .

ROC curve plots TPR against FPR

(TPR,FPR):

Diagonal line represents random guessing

To draw ROC curve, classifier must produce continuous-valued output

Many classifiers produce only discrete outputs (i.e., predicted class)

 

- 1-dimensional data set containing 2 classes (positive and negative)

- Any points located at x > t is classified as positive (threshold = t)

 

ROC  The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

Metric Formula Equivalent
True Positive Rate
TPR
Recall, sensitivity
False Positive Rate
FPR

FP/
TN+FP

1-specificity

AUC  The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:



Using ROC for model comparison

In this example, no model consistently outperforms the other

Area under the ROC curve serves as the comparison measure

Ideal: Area = 1

Random guess: Area = 0.5

 

 

 


Constructing a ROC curve

Use a classifier that produces a continuous-valued score for each instance

Sort the instances in decreasing order according to the score

Apply a threshold at each unique value of the score

Count the number of TP, FP, TN, FN at each threshold