DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Lots of classification problems where the classes are skewed (significantly more records from one class than another). This has connections with anomoly detection we've covered earlier, but later in the book.

- Credit card fraud
- Intrusion detection
- Defective products in manufacturing assembly line

Challenges

- Simple evaluation measures such as accuracy are not well-suited for imbalanced class
- Detecting the rare class is like finding needle in a haystack

Fixing the imbalance option (**sampling based** approach)

- Undersampling of the majority class
- Oversampling of the minority by creating artificial examples to equalize the balance

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Accuracy is the most widely-used metric:

Consider a 2-class problem

- Number of Class NO examples = 990
- Number of Class YES examples = 10

If a model predicts everything to be class NO, accuracy is 990/1000 = 99 %

- This is misleading because the model does not detect any class YES example
- Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc)

The traditional F-measure or balanced F-score (**F1 score**) is the harmonic mean of precision and recall:

- F = 1 / (1/r + 1/p)/2 )

[from Wikipedia]

**α** is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP).

**β** is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN).

**Confusion matrix** — The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

Predicted class |
|||

+ |
- |
||

Actual class |
+ |
TPTrue Positives |
FNFalse Negatives Type II error |

- |
FPFalse Positives Type I error |
TNTrue Negatives |

**Main metrics** — The following additional metrics are commonly used to assess the performance of classification models:

Metric |
Interpretation |

Accuracy | Overall performance of model |

Precision | How accurate the positive predictions are |

Recall Sensitivity |
Coverage of actual positive sample |

Specificity | Coverage of actual negative sample |

F1 score | Hybrid metric useful for unbalanced classes |

ROC is a graphical approach for displaying trade-off between detection rate and false alarm rate.

Developed in 1950s for signal detection theory to analyze noisy signals .

ROC curve plots TPR against FPR

- Performance of a model represented as a point in an ROC curve
- Changing the threshold parameter of classifier changes the location of the point

(TPR,FPR):

- (0,0): declare everything to be negative class
- (1,1): declare everything to be positive class
- (1,0): ideal

Diagonal line represents random guessing

- Below diagonal line prediction is opposite of the true class

To draw ROC curve, classifier must produce continuous-valued output

- Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record

Many classifiers produce only discrete outputs (i.e., predicted class)

- How to get continuous-valued outputs?
- Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM

- 1-dimensional data set containing 2 classes (positive and negative)

- Any points located at x > t is classified as positive (threshold = t)

**ROC** — The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

Metric |
Formula |
Equivalent |

True Positive Rate TPR |
Recall, sensitivity | |

False Positive Rate FPR |
FP/ |
1-specificity |

**AUC** — The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:

In this example, no model consistently outperforms the other

- M
_{1}is better for small FPR - M
_{2}is better for large FPR

Area under the ROC curve serves as the comparison measure

Ideal: Area = 1

Random guess: Area = 0.5

Use a classifier that produces a continuous-valued score for each instance

- The more likely it is for the instance to be in the + class, the higher the score

Sort the instances in decreasing order according to the score

Apply a threshold at each unique value of the score

Count the number of TP, FP, TN, FN at each threshold

- TPR = TP/(TP+FN)
- FPR = FP/(FP + TN)