# Similarity and Dissimilarity

last updated 23-Jun-2021

## Introduction

We consider similarity and dissimilarity in many places in data science.

Similarity measure

• is a numerical measure of how alike two data objects are.
• higher when objects are more alike.
• often falls in the range [0,1]

Similarity might be used to identify

• duplicate data that may have differences due to typos.
• equivalent instances from different data sets. E.g. names and/or addresses that are the same but have misspellings.
• groups of data that are very close (clusters)

Dissimilarity measure

• is a numerical measure of how different two data objects are
• lower when objects are more alike
• minimum dissimilarity is often 0 while the upper limit varies depending on how much variation can be

Dissimilarity might be used to identify

• outliers
• interesting exceptions, e.g. credit card fraud
• boundaries to clusters

Proximity refers to either a similarity or dissimilarity

## Single attribute sim/dissim measures

Nominal is binary if two values are equal or not

Ordinal is the difference between two values, normalized by the maximum distance

Quantitative dissimilarity is just a distance between, similarity attempts to scale that distance to [0,1]

## Distance between instances with multiple attributes.

Attributes are naturally numbers or ordinal, but nominal must resort to the binary 0 or 1 if match or not

Euclidean distance

where n is the number of dimensions (attributes) and xk and yk  are, respectively, the k-th attributes (components) or data objects x and y

Standardization/normalization may be necessary to ensure an attribute does not skew the distances due to different scales.

## Minkowski Distance

is a generalization of Euclidean Distance

where r is a parameter, n is the number of dimensions (attributes) and xk and yk  are, respectively, the k-th attributes (components) or data objects x and y.

### Examples

r = 1. "City block",  "Manhattan", "taxicab", L1 norm distance.

• Another example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

r = 2. Euclidean distance (L2 norm)

r = ∞. “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distance measures are defined for all numbers of dimensions.

## Mahalanobis Distance

Essentially this measures distance from the centroid of the cluster and there's significant correlation among the attributes.

## Common Properties of Distance

Distances, such as the Euclidean distance, have some well known properties.

1. Positivity: d(x, y) ≥ 0 for all x and y, and d(x, y) = 0 only if x = y.
2. Symmetry: d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality: d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.

where d(x, y) is the distance (dissimilarity) between points (data objects), x and y.

A distance that satisfies these properties is a metric.

## Similarity Properties

Similarities, also have some well known properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects), x and y.

## Similarity Between Binary Vectors

A common situation is that two objects, p and q, have only binary attributes.  Example: A set of products bought or not.

Compute similarities using the following quantities (counts)

f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

### Simple Matching and Jaccard Coefficients

Simple Matching Coef. SMC is measuring similarity in a symmetric setting, i.e., both 0 and 1 matches

• SMC = number of matches / number of attributes = (f11 + f00) / (f01 + f10 + f11 + f00)

Jaccard measures an asymmetric setting, i.e., only consider matching of 1's, ignoring the 0's also in the denominator

J = number of 11 matches / number of non-zero attributes = (f11) / (f01 + f10 + f11)

### Example 2

x = 1 0 1 1 0 1 0 0 0 1
y = 1 1 0 1 0 0 0 0 1 1

f01= 2,  f10= 2, f00=3, f11= 3

SMC = 7/10 and J = 3/7

## Cosine Similarity

When you have a set of quantified attributes for each instance-- an alternative to Minkowski distances.

If d1 and d2 are two document vectors, then cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product  d1'd2 of vectors  d1 and d2, and || d || is the length of vector d.

Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)1/2 = (42) 1/2 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 1/2 = (6)1/2 = 2.449

cos(d1, d2 ) = 0.3150

### Extended Jaccard Coefficient (Tanimoto)

Variation of Jaccard for continuous or count attributes

• Reduces to Jaccard for binary attributes

## Correlation

### Counterexample of correlation computation

x = (-3, -2, -1, 0, 1, 2, 3)
y = (9, 4, 1, 0, 1, 4, 9)

In this case, notice that yi = xi2 so there's clearly a functional relationship between x and y

μ(x) = 0, μ(y) = 4
σ(x) = 2.16, σ(y) = 3.74

But the correlation = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 ) = 0

## Comparison of Proximity Measures

How to choose among the proximity measures?

Domain of application often drives choice

• Similarity measures tend to depend on the type of attribute and data
• Record data, images, graphs, sequences, 3D-protein structure, etc. will use different measures

However, one can talk about various properties that you would like a proximity measure to have

• Symmetry is a common one
• Tolerance to noise and outliers is another
• Ability to find more types of patterns?
• Many others possible

The measure must be applicable to the data and produce results that agree with domain knowledge.

 Property Cosine Correlation Minkowski Invariant to scaling (multiply) Yes Yes No Invariant to translations (add) Yes Yes No x=(1,2,4,3,0,0) and y=(1,2,3,4,0,0,0) 0.9667 0.9429 1.412 (r=2) x same and s = (2,4,6,8,0,0,0) scaled x 0.9667 0.9429 5.831 (r=2) x samd and t = (6,7,8,9,5,5,5) translated x 0.794 0.9429 14.2127 (r=2)

## Information Based Measures

Information theory is a well-developed and fundamental disciple with broad applications

Information relates to possible outcomes of an event

E,g, transmission of a message, flip of a coin, or measurement of a piece of data

The more certain an outcome, the less information that it contains and vice-versa

• For example, if a coin has two heads, then an outcome of heads provides no information

More quantitatively, the information is [inversely] related the probability of an outcome

• The smaller the probability of an outcome, the more information it provides and vice-versa

## Entropy

is the commonly used measure for information

For a variable (event), X, with n possible values (outcomes), x1, x2 …, xn
and each outcome having probability, p1, p2 …, pn

the entropy of X , H(X), is given by

0 ≤ H(X) ≤ log2n and is measured in bits.

Thus, entropy is a measure of how many bits it takes to represent an observation of X on average. Not likely an integer!

### Examples

For a coin with probability p for heads and probability q = 1 – p for tails

H = - p log2 p - q log2 q

For a fair coin, p= 0.5, q = 0.5 H = 1
For a weighted coin, p = 1 or q = 1, H = 0

A more realistic example

 Hair Color Count p -p log2 p Black 75 0.75 0.3113 Brown 15 0.15 0.4105 Blond 5 0.05 0.2161 Red 0 0.00 0 Other 5 0.05 0.2161 Total 100 1.0 1.1540

Maximum entropy is log25 = 2.3219, that is, we need more than 2 bits for 5 unique values, but not more than 3

In general, suppose we have a number of observations (m) of some attribute, X, e.g., the hair color of students in the class,

where there are n different possible values.

And the number of observations in the ith category is mi , thus the probability is mi / m

Then, for this sample, the entropy is

For continuous data, the calculation is harder.

## Density

Measures the degree to which data objects are close to each other in a specified area

The notion of density is closely related to that of proximity

Concept of density is typically used for clustering and anomaly detection

Examples:

• Euclidean density = number of points per unit volume
• Probability density: Estimate what the distribution of the data looks like
• Graph-based density: Connectivity

### Euclidean Density: Grid-based Approach

Simplest approach is to divide region into a number of rectangular cells of equal volume,
then define density as # of points the cell contains.

Euclidean density is the number of points within a specified radius of the point