DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

These topics have been referenced in other settings. Much may be in review, but we can use some of the algorithms covered in the course for these purposes.

- The set of data points that are considerably different than the remainder of the data

Natural implication is that anomalies are relatively rare

- One in a thousand occurs often if you have lots of data
- Context is important, e.g., freezing temps in July

Can be important or a nuisance

- 10 foot tall 2 year old
- Unusually high blood pressure

**Fraud Detection:**odd credit card charges**Intrusion Detection:**a quick sequence of authorization failures**Ecosystem Disturbances:**floods, droughts, heat waves**Medicine and public health:**influenza outbreaks**Aviation Safety:**abnormal pilot behavior or aircraft sequence of events

In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels

Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?

The ozone concentrations recorded by the satellite were so low they were being treated as outliers by the software and discarded!

[Sources:

http://exploringdata.cqu.edu.au/ozone.html

http://www.epa.gov/ozone/science/hole/size.html]

**Data from different classes:** Measuring the weights of oranges, but a few grapefruit are mixed in

**Natural variation:
**Unusually tall people

**Data errors:** 200 pound 2 year old

Noiseis erroneous, perhaps random, values or contaminating objects

- Weight recorded incorrectly
- Grapefruit mixed in with the oranges
Noise doesn’t necessarily produce unusual values or objects, which may be harder to detect, but noise nonetheless.

Noise is not interesting, unless it can be used to rate the quality/accuracy of the instrument generating the data.

Anomaliesmay be interesting, provided they are not a result of noise.While noise and anomalies are related. they have distinct concepts

Many anomalies are defined in terms of a single attribute--it's easy to visualize and automate

- Height
- Shape
- Color
However, an object may not be anomalous in any one attribute

Can be hard to find an anomaly

using all attributes

- 200 pound 2 year old as an example: 200 pounds by itself is reasonable as is 2 years old
- Noisy or irrelevant attributes
- Object is only anomalous with respect to a subset of attributes
Pairings of attributes

- as in a scatterplot matrix
- other combinations may be scrutinized

Many anomaly detection techniques provide only a

binary categorization

- An object is an anomaly or it isn’t
- This is especially true of classification-based approaches
Other approaches

assign a scoreto all points

- This score measures the degree to which an object is an anomaly
- This allows objects to be ranked
In the end, you often still need a binary decision

- Should this credit card transaction be flagged?
- Still useful to have a score (which can be assessed in quality if determined to be wrong)
How many anomalies are there?

Find all anomalies at once or one at a time

- Swamping -- non outliers identified as an outlier
- Masking -- an outlier not identified as outliers
Evaluation

- How do you measure performance? Quality
- Supervised vs. unsupervised situations
Efficiency

- cost of performance
- complexity
Context -- what do you as a data scientist bring to the table?

Given a data set D, find all data points **x ∈ D** with anomaly scores greater than some threshold** t**

Given a data set D, find all data points **x∈ D** having the top-**n ** largest anomaly scores

Given a data set D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D

Build a model for the data and review the effects

**Unsupervised**

- Anomalies are those points that don’t fit well
- Anomalies are those points that distort the model
- Examples:
- Statistical distribution
- Clusters
- Regression
- Geometric
- Graph

**Supervised**

- Anomalies are regarded as a rare class
- Need to have training data

**Proximity-based**

- Anomalies are points far away from other points
- Can detect this graphically in some cases

**Density-based**

- Low density points are outliers

**Pattern matching**

- Create profiles or templates of atypical but important events or objects
- Algorithms to detect these patterns are usually simple and efficient

Boxplots or scatterplots for one attribute or pairs of attributes, and quantitive.

Limitations of the visual approach are subjectivity and not automated.

outliers are points beyond the whiskers.

outliers not fitting into the visual pattern

** Probabilistic definition of an outlier**: An outlier is an object that has a low probability with respect to a probability distribution model of the data.

Usually assume a parametric model describing the distribution of the data (e.g., normal distribution)

- Data distribution
- Parameters of distribution (e.g., mean, variance)
- Number of expected outliers (confidence limit, based on p-value such as 0.05 as below)

- Identifying the distribution of a data set
- Do you have a heavy tailed distribution?
- Number of attributes
- Is the data a mixture of distributions?

Detects outliers in univariate data (algorithm is detailed in Exercise 9.7 at the end of the chapter)

Assumes data comes from normal distribution

Detects one outlier at a time using Z-scores, remove the outlier, and repeat

H

_{0}: There is no outlier in dataH

_{A}: There is at least one outlierGrubbs’ test statistic: finding the largest z-score (X-bar is the mean and s is the standard deviation)

Reject H

_{0}if:, this threshold is based on the the normal distribution based on an α confidence levelrepeat until no more outliers detected.

Need to recalculate mean and standard deviation after removing the outlier.

Firm mathematical foundation

Can be very efficient

Good results if distribution is known, but in many cases, data distribution may not be known

For high dimensional data, it may be difficult to estimate the true distribution

Anomalies can distort the parameters of the distribution

Several different techniques.

An object is an outlier if a specified fraction of the objects is more than a specified distance away (Knorr, Ng 1998)

Some statistical definitions are special cases of this.

The outlier score of an object is the distance to its k^{th} nearest neighbors.

Simple

Expensive – O(n

^{2})Sensitive to parameters

Sensitive to variations in density

Distance becomes less meaningful in high-dimensional space

**Density-based Outlier:** The outlier score of an object is the inverse of the density around the object.

- Can be defined in terms of the k nearest neighbors
- One definition: Inverse of distance to k
^{th}neighbor - Another definition: Inverse of the average distance to k neighbors
- DBSCAN definition

If there are regions of different density, this approach can have problems

Consider the density of a point relative to that of its k nearest neighbors

For each point, compute the density of its local neighborhood

Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors

Outliers are points with largest LOF value

In the Nearest Neighbor approach, *p _{2}*

Simple

Expensive – O(n

^{2})Sensitive to parameters

Density becomes less meaningful in high-dimensional space

Clustering-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster

- For prototype-based clusters, an object is an outlier if it is not close enough to a cluster center
- For density-based clusters, an object is an outlier if its density is too low
- For graph-based clusters, an object is an outlier if it is not well connected

Other issues include the impact of outliers on the clusters and the number of clusters

Simple

Many clustering techniques can be used

Can be difficult to decide on a clustering technique

Can be difficult to decide on number of clusters

Outliers can distort the clusters