Anomaly/Outlier Detection

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Introduction

These topics have been referenced in other settings. Much may be in review, but we can use some of the algorithms covered in the course for these purposes.

What are anomalies/outliers?

Natural implication is that anomalies are relatively rare

Can be important or a nuisance

Examples of anomaly detection

  1. Fraud Detection: odd credit card charges
  2. Intrusion Detection: a quick sequence of authorization failures
  3. Ecosystem Disturbances: floods, droughts, heat waves
  4. Medicine and public health: influenza outbreaks
  5. Aviation Safety: abnormal pilot behavior or aircraft sequence of events

 

Ozone Depletion History Example

In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels

Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?

The ozone concentrations recorded by the satellite were so low they were being treated as outliers by the software and discarded!

[Sources:
    http://exploringdata.cqu.edu.au/ozone.html 
    http://www.epa.gov/ozone/science/hole/size.html]

 


Causes of Anomalies

Data from different classes: Measuring the weights of oranges, but a few grapefruit are mixed in

Natural variation: Unusually tall people

Data errors: 200 pound 2 year old

Noise versus Anomalies

Noise is erroneous, perhaps random, values or contaminating objects

Noise doesn’t necessarily produce unusual values or objects, which may be harder to detect, but  noise nonetheless.

Noise is not interesting, unless it can be used to rate the quality/accuracy of the instrument generating the data.

Anomalies may be interesting, provided they are not a result of noise.

While noise and anomalies are related. they have distinct concepts

 


General Issues

Number of attributes:

Many anomalies are defined in terms of a single attribute--it's easy to visualize and automate

However, an object may not be anomalous in any one attribute

Can be hard to find an anomaly using all attributes

Pairings of attributes

 

Anomaly Scoring

Many anomaly detection techniques provide only a binary categorization

Other approaches assign a score to all points

In the end, you often still need a binary decision

How many anomalies are there?

Other issues for anomaly detection

Find all anomalies at once or one at a time

Evaluation

Efficiency

Context -- what do you as a data scientist bring to the table?


Variants of Anomaly Detection Problems

Given a data set D, find all data points x ∈ D with anomaly scores greater than some threshold t

Given a data set D, find all data points x∈ D having the top-n largest anomaly scores

Given a data set D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D

 


Model-based Anomaly Detection

Build a model for the data and review the effects

Unsupervised

Supervised

Proximity-based

Density-based

Pattern matching


Visual approaches

Boxplots or scatterplots for one attribute or pairs of attributes, and quantitive.

Limitations of the visual approach are subjectivity and not automated.

    outliers are points beyond the whiskers.

  outliers not fitting into the visual pattern

 


Statistical Approaches

Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data.

Usually assume a parametric model describing the distribution of the data (e.g., normal distribution)

Applying a statistical test depends on

Issues

Normal Distributions


Grubb's Test of Univariate Data

Detects outliers in univariate data (algorithm is detailed in Exercise 9.7 at the end of the chapter)

Assumes data comes from normal distribution

Detects one outlier at a time using Z-scores, remove the outlier, and repeat

H0: There is no outlier in data

HA: There is at least one outlier

Grubbs’ test statistic: finding the largest z-score (X-bar is the mean and s is the standard deviation)

Reject H0 if:, this threshold is based on the the normal distribution based on an α confidence level

repeat until no more outliers detected.

Need to recalculate mean and standard deviation after removing the outlier.

Strengths/Weaknesses of Statistical Approaches

Firm mathematical foundation

Can be very efficient

Good results if distribution is known, but in many cases, data distribution may not be known

For high dimensional data, it may be difficult to estimate the true distribution

Anomalies can distort the parameters of the distribution


Distance-Based Approaches

Several different techniques.

An object is an outlier if a specified fraction of the objects is more than a specified distance away (Knorr, Ng 1998)
Some statistical definitions are special cases of this.

The outlier score of an object is the distance to its kth nearest neighbors.

One nearest neighbor - one outlier

One nearest neighbor - two outliers

Five nearest neighbors - small cluster

Five nearest neighbors - differing density

 

Strengths/Weaknesses of Distance-Based Approaches

Simple

Expensive – O(n2)

Sensitive to parameters

Sensitive to variations in density

Distance becomes less meaningful in high-dimensional space


Density-Based Approaches

Density-based Outlier: The outlier score of an object is the inverse of the density around the object.

If there are regions of different density, this approach can have problems

Relative Density

Consider the density of a point relative to that of its k nearest neighbors

Local outlier factor (LOF)

For each point, compute the density of its local neighborhood

Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors

Outliers are points with largest LOF value

In the Nearest Neighbor approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers

Strengths and Weaknesses

Simple

Expensive – O(n2)

Sensitive to parameters

Density becomes less meaningful in high-dimensional space


Clustering-Based Approaches

Clustering-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster

Other issues include the impact of outliers on the clusters and the number of clusters

 

 

Distance of Points from Closest Centroids

Relative Distance of Points from Closest Centroid

Strengths/Weaknesses

Simple

Many clustering techniques can be used

Can be difficult to decide on a clustering technique

Can be difficult to decide on number of clusters

Outliers can distort the clusters