DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

We consider similarity and dissimilarity in many places in data science.

**Similarity** measure

- is a numerical measure of how alike two data objects are.
- higher when objects are more alike.
- often falls in the range [0,1]

Similarity might be used to identify

- duplicate data that may have differences due to typos.
- equivalent instances from different data sets. E.g. names and/or addresses that are the same but have misspellings.
- groups of data that are very close (clusters)

**Dissimilarity** measure

- is a numerical measure of how different two data objects are
- lower when objects are more alike
- minimum dissimilarity is often 0 while the upper limit varies depending on how much variation can be

Dissimilarity might be used to identify

- outliers
- interesting exceptions, e.g. credit card fraud
- boundaries to clusters

**Proximity** refers to either a similarity or dissimilarity

**Nominal** is binary if two values are equal or not

**Ordinal** is the difference between two values, normalized by the maximum distance

**Quantitative** dissimilarity is just a distance between, similarity attempts to scale that distance to [0,1]

Attributes are naturally numbers or ordinal, but nominal must resort to the binary 0 or 1 if match or not

Euclidean distance

where *n* is the number of dimensions (attributes) and x_{k} and y_{k} are, respectively, the* k-th* attributes (components) or data objects **x** and **y**

Standardization/normalization may be necessary to ensure an attribute does not skew the distances due to different scales.

is a generalization of Euclidean Distance

where *r* is a parameter, *n* is the number of dimensions (attributes) and x_{k} and y_{k} are, respectively, the *k-th *attributes (components) or data objects **x** and **y.**

**r = 1.** "City block", "Manhattan", "taxicab", L_{1} norm distance.

- Another example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

**r = 2. **Euclidean distance (L_{2} norm)

**r = ∞.** “supremum” (L_{max} norm, L_{∞} norm) distance.
This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distance measures are defined for all numbers of dimensions.

Essentially this measures distance from the centroid of the cluster and there's significant correlation among the attributes.

Distances, such as the Euclidean distance, have some well known properties.

**Positivity:**d(x, y) ≥ 0 for all x and**y**, and d(x, y) = 0 only if**x**=**y**.**Symmetry**: d(x, y) = d(y, x) for all x and**y**.**Triangle Inequality:**d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.

where d(x, y) is the distance (dissimilarity) between points (data objects), x and y.

A distance that satisfies these properties is a **metric**.

Similarities, also have some well known properties.

- s(x, y) = 1 (or maximum similarity) only if x = y (0 ≤ s ≤ 1)
- s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.

A common situation is that two objects, p and q, have only binary attributes. Example: A set of products bought or not.

Compute similarities using the following quantities (counts)

f

_{01}= the number of attributes where p was 0 and q was 1

f_{10}= the number of attributes where p was 1 and q was 0

f_{00}= the number of attributes where p was 0 and q was 0

f_{11}= the number of attributes where p was 1 and q was 1

Simple Matching Coef. SMC is measuring similarity in a symmetric setting, i.e., both 0 and 1 matches

- SMC = number of matches / number of attributes = (f
_{11}+ f_{00}) / (f_{01}+ f_{10}+ f_{11}+ f_{00})Jaccard measures an asymmetric setting, i.e., only consider matching of 1's, ignoring the 0's also in the denominator

J = number of 11 matches / number of non-zero attributes = (f

_{11}) / (f_{01}+ f_{10}+ f_{11})

x = 1 0 1 1 0 1 0 0 0 1

y = 1 1 0 1 0 0 0 0 1 1

f_{01}= 2,f_{10}= 2,f_{00}=3,f_{11}= 3SMC = 7/10 and J = 3/7

When you have a set of quantified attributes for each instance-- an alternative to Minkowski distances.

If **d _{1}** and

where <

Example:

d= 3 2 0 5 0 0 0 2 0 0_{1}

d= 1 0 0 0 0 0 0 1 0 2_{2}

<**d _{1}**,

||

||

* cos*(

Variation of Jaccard for continuous or count attributes

- Reduces to Jaccard for binary attributes

x = (-3, -2, -1, 0, 1, 2, 3)

y = (9, 4, 1, 0, 1, 4, 9)In this case, notice that

yso there's clearly a functional relationship between x and y_{i}= x_{i}^{2}μ(x) = 0, μ(y) = 4

σ(x) = 2.16, σ(y) = 3.74But the correlation = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 ) = 0

How to choose among the proximity measures?

Domain of application often drives choice

- Similarity measures tend to depend on the type of attribute and data
- Record data, images, graphs, sequences, 3D-protein structure, etc. will use different measures

However, one can talk about various properties that you would like a proximity measure to have

- Symmetry is a common one
- Tolerance to noise and outliers is another
- Ability to find more types of patterns?
- Many others possible

The measure must be applicable to the data and produce results that agree with domain knowledge.

Property |
Cosine |
Correlation |
Minkowski |

Invariant to scaling (multiply) | Yes | Yes | No |

Invariant to translations (add) | Yes | Yes | No |

x=(1,2,4,3,0,0) and y=(1,2,3,4,0,0,0) | 0.9667 | 0.9429 | 1.412 (r=2) |

x same and s = (2,4,6,8,0,0,0) scaled x | 0.9667 | 0.9429 | 5.831 (r=2) |

x samd and t = (6,7,8,9,5,5,5) translated x | 0.794 | 0.9429 | 14.2127 (r=2) |

** Information theory** is a well-developed and fundamental disciple with broad applications

Information relates to possible outcomes of an event

E,g, transmission of a message, flip of a coin, or measurement of a piece of data

The more certain an outcome, the less information that it contains and vice-versa

- For example, if a coin has two heads, then an outcome of heads provides no information

More quantitatively, the information is [inversely] related the probability of an outcome

- The smaller the probability of an outcome, the more information it provides and vice-versa

is the commonly used measure for information

For a variable (event), X, with n possible values (outcomes),

x_{1}, x_{2}…, x_{n}

and each outcome having probability,p_{1}, p_{2}…, p_{n}the entropy of X ,

, is given byH(X)

and is measured in0 ≤ H(X) ≤ log_{2}nbits.

Thus, entropy is a measure of how many bits it takes to represent an observation of X on average. Not likely an integer!

For a coin with probability p for heads and probability q = 1 – p for tails

H = - p log_{2}p - q log_{2}qFor a fair coin, p= 0.5, q = 0.5

H = 1

For a weighted coin, p = 1 or q = 1,H = 0A more realistic example

Hair Color

Count

p

-p log_{2 }p

Black75

0.75

0.3113

Brown15

0.15

0.4105

Blond5

0.05

0.2161

Red0

0.00

0

Other5

0.05

0.2161

Total

100

1.0

1.1540Maximum entropy is log

_{2}5 = 2.3219, that is, we need more than 2 bits for 5 unique values, but not more than 3

In general, suppose we have a number of observations (**m**) of some attribute, X, e.g., the hair color of students in the class,

where there are

ndifferent possible values.And the number of observations in the

icategory is^{th}m_{i}, thus the probability is_{}m_{i}/ mThen, for this sample, the entropy is

For continuous data, the calculation is harder.

Measures the degree to which data objects are close to each other in a specified area

The notion of density is closely related to that of proximity

Concept of density is typically used for clustering and anomaly detection

Examples:

**Euclidean density**= number of points per unit volume-
**Probability density**: Estimate what the distribution of the data looks like -
**Graph-based density**: Connectivity

Simplest approach is to divide region into a number of rectangular cells of equal volume,

then define density as # of points the cell contains.

Euclidean density is the number of points within a specified radius of the point