Similarity and Dissimilarity

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Introduction

We consider similarity and dissimilarity in many places in data science.

Similarity measure

Similarity might be used to identify

Dissimilarity measure

Dissimilarity might be used to identify

Proximity refers to either a similarity or dissimilarity


Single attribute sim/dissim measures

Nominal is binary if two values are equal or not

Ordinal is the difference between two values, normalized by the maximum distance

Quantitative dissimilarity is just a distance between, similarity attempts to scale that distance to [0,1]

 


Distance between instances with multiple attributes.

Attributes are naturally numbers or ordinal, but nominal must resort to the binary 0 or 1 if match or not

Euclidean distance

where n is the number of dimensions (attributes) and xk and yk  are, respectively, the k-th attributes (components) or data objects x and y

Standardization/normalization may be necessary to ensure an attribute does not skew the distances due to different scales.


Minkowski Distance

is a generalization of Euclidean Distance

where r is a parameter, n is the number of dimensions (attributes) and xk and yk  are, respectively, the k-th attributes (components) or data objects x and y.

Examples

r = 1. "City block",  "Manhattan", "taxicab", L1 norm distance.

r = 2. Euclidean distance (L2 norm)

r = ∞. “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distance measures are defined for all numbers of dimensions.


Mahalanobis Distance

Essentially this measures distance from the centroid of the cluster and there's significant correlation among the attributes.


Common Properties of Distance

Distances, such as the Euclidean distance, have some well known properties.

  1. Positivity: d(x, y) ≥ 0 for all x and y, and d(x, y) = 0 only if x = y.
  2. Symmetry: d(x, y) = d(y, x) for all x and y.
  3. Triangle Inequality: d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.

where d(x, y) is the distance (dissimilarity) between points (data objects), x and y.

A distance that satisfies these properties is a metric.

Similarity Properties

Similarities, also have some well known properties.

  1. s(x, y) = 1 (or maximum similarity) only if x = y (0 ≤ s ≤ 1)
  2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.


Similarity Between Binary Vectors

A common situation is that two objects, p and q, have only binary attributes.  Example: A set of products bought or not.

Compute similarities using the following quantities (counts)

f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

Simple Matching and Jaccard Coefficients

Simple Matching Coef. SMC is measuring similarity in a symmetric setting, i.e., both 0 and 1 matches

Jaccard measures an asymmetric setting, i.e., only consider matching of 1's, ignoring the 0's also in the denominator

J = number of 11 matches / number of non-zero attributes = (f11) / (f01 + f10 + f11)

Example SMC v Jaccard

Example 2

x = 1 0 1 1 0 1 0 0 0 1
y = 1 1 0 1 0 0 0 0 1 1

f01= 2,  f10= 2, f00=3, f11= 3

SMC = 7/10 and J = 3/7


Cosine Similarity

When you have a set of quantified attributes for each instance-- an alternative to Minkowski distances.

If d1 and d2 are two document vectors, then cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product  d1'd2 of vectors  d1 and d2, and || d || is the length of vector d.

Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)1/2 = (42) 1/2 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 1/2 = (6)1/2 = 2.449

cos(d1, d2 ) = 0.3150

 

Extended Jaccard Coefficient (Tanimoto)

Variation of Jaccard for continuous or count attributes


Correlation

Visually evaluating correlation

Counterexample of correlation computation

x = (-3, -2, -1, 0, 1, 2, 3)
y = (9, 4, 1, 0, 1, 4, 9)

In this case, notice that yi = xi2 so there's clearly a functional relationship between x and y

μ(x) = 0, μ(y) = 4
σ(x) = 2.16, σ(y) = 3.74

But the correlation = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 ) = 0


Comparison of Proximity Measures

How to choose among the proximity measures?

Domain of application often drives choice

However, one can talk about various properties that you would like a proximity measure to have

The measure must be applicable to the data and produce results that agree with domain knowledge.

Property Cosine Correlation Minkowski
Invariant to scaling (multiply) Yes Yes No
Invariant to translations (add) Yes Yes No
x=(1,2,4,3,0,0) and y=(1,2,3,4,0,0,0) 0.9667 0.9429

1.412 (r=2)

x same and s = (2,4,6,8,0,0,0) scaled x 0.9667 0.9429 5.831 (r=2)
x samd and t = (6,7,8,9,5,5,5) translated x 0.794 0.9429 14.2127 (r=2)

 


Information Based Measures

Information theory is a well-developed and fundamental disciple with broad applications

 

Information relates to possible outcomes of an event

E,g, transmission of a message, flip of a coin, or measurement of a piece of data

The more certain an outcome, the less information that it contains and vice-versa

More quantitatively, the information is [inversely] related the probability of an outcome

 


Entropy

is the commonly used measure for information

For a variable (event), X, with n possible values (outcomes), x1, x2 …, xn
and each outcome having probability, p1, p2 …, pn

the entropy of X , H(X), is given by

0 ≤ H(X) ≤ log2n and is measured in bits.

Thus, entropy is a measure of how many bits it takes to represent an observation of X on average. Not likely an integer!

Examples

For a coin with probability p for heads and probability q = 1 – p for tails

H = - p log2 p - q log2 q

For a fair coin, p= 0.5, q = 0.5 H = 1
For a weighted coin, p = 1 or q = 1, H = 0

A more realistic example

Hair Color

Count

p

-p log2 p

Black

75

0.75

0.3113

Brown

15

0.15

0.4105

Blond

5

0.05

0.2161

Red

0

0.00

0

Other

5

0.05

0.2161

Total

100

1.0

1.1540

Maximum entropy is log25 = 2.3219, that is, we need more than 2 bits for 5 unique values, but not more than 3

In general, suppose we have a number of observations (m) of some attribute, X, e.g., the hair color of students in the class,

where there are n different possible values.

And the number of observations in the ith category is mi , thus the probability is mi / m

Then, for this sample, the entropy is

For continuous data, the calculation is harder.


Density

Measures the degree to which data objects are close to each other in a specified area

The notion of density is closely related to that of proximity

Concept of density is typically used for clustering and anomaly detection

Examples:

Euclidean Density: Grid-based Approach

Simplest approach is to divide region into a number of rectangular cells of equal volume,
then define density as # of points the cell contains.

Euclidean density is the number of points within a specified radius of the point