DM 352 Syllabus | DM 552 Syllabus
last updated 23-Jun-2021
We consider similarity and dissimilarity in many places in data science.
Similarity might be used to identify
- duplicate data that may have differences due to typos.
- equivalent instances from different data sets. E.g. names and/or addresses that are the same but have misspellings.
- groups of data that are very close (clusters)
Dissimilarity might be used to identify
- interesting exceptions, e.g. credit card fraud
- boundaries to clusters
Proximity refers to either a similarity or dissimilarity
Nominal is binary if two values are equal or not
Ordinal is the difference between two values, normalized by the maximum distance
Quantitative dissimilarity is just a distance between, similarity attempts to scale that distance to [0,1]
Attributes are naturally numbers or ordinal, but nominal must resort to the binary 0 or 1 if match or not
where n is the number of dimensions (attributes) and xk and yk are, respectively, the k-th attributes (components) or data objects x and y
Standardization/normalization may be necessary to ensure an attribute does not skew the distances due to different scales.
is a generalization of Euclidean Distance
where r is a parameter, n is the number of dimensions (attributes) and xk and yk are, respectively, the k-th attributes (components) or data objects x and y.
r = 1. "City block", "Manhattan", "taxicab", L1 norm distance.
r = 2. Euclidean distance (L2 norm)
r = ∞. “supremum” (Lmax norm, L∞ norm) distance. This is the maximum difference between any component of the vectors
Do not confuse r with n, i.e., all these distance measures are defined for all numbers of dimensions.
Essentially this measures distance from the centroid of the cluster and there's significant correlation among the attributes.
Distances, such as the Euclidean distance, have some well known properties.
where d(x, y) is the distance (dissimilarity) between points (data objects), x and y.
A distance that satisfies these properties is a metric.
Similarities, also have some well known properties.
where s(x, y) is the similarity between points (data objects), x and y.
A common situation is that two objects, p and q, have only binary attributes. Example: A set of products bought or not.
Compute similarities using the following quantities (counts)
f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1
Simple Matching Coef. SMC is measuring similarity in a symmetric setting, i.e., both 0 and 1 matches
- SMC = number of matches / number of attributes = (f11 + f00) / (f01 + f10 + f11 + f00)
Jaccard measures an asymmetric setting, i.e., only consider matching of 1's, ignoring the 0's also in the denominator
J = number of 11 matches / number of non-zero attributes = (f11) / (f01 + f10 + f11)
x = 1 0 1 1 0 1 0 0 0 1
y = 1 1 0 1 0 0 0 0 1 1
f01= 2, f10= 2, f00=3, f11= 3
SMC = 7/10 and J = 3/7
When you have a set of quantified attributes for each instance-- an alternative to Minkowski distances.
If d1 and d2 are two document vectors, then
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product d1'd2 of vectors d1 and d2, and || d || is the length of vector d.
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)1/2 = (42) 1/2 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 1/2 = (6)1/2 = 2.449
cos(d1, d2 ) = 0.3150
Variation of Jaccard for continuous or count attributes
x = (-3, -2, -1, 0, 1, 2, 3)
y = (9, 4, 1, 0, 1, 4, 9)
In this case, notice that yi = xi2 so there's clearly a functional relationship between x and y
μ(x) = 0, μ(y) = 4
σ(x) = 2.16, σ(y) = 3.74
But the correlation = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 ) = 0
How to choose among the proximity measures?
Domain of application often drives choice
However, one can talk about various properties that you would like a proximity measure to have
The measure must be applicable to the data and produce results that agree with domain knowledge.
|Invariant to scaling (multiply)||Yes||Yes||No|
|Invariant to translations (add)||Yes||Yes||No|
|x=(1,2,4,3,0,0) and y=(1,2,3,4,0,0,0)||0.9667||0.9429||
|x same and s = (2,4,6,8,0,0,0) scaled x||0.9667||0.9429||5.831 (r=2)|
|x samd and t = (6,7,8,9,5,5,5) translated x||0.794||0.9429||14.2127 (r=2)|
Information theory is a well-developed and fundamental disciple with broad applications
Information relates to possible outcomes of an event
E,g, transmission of a message, flip of a coin, or measurement of a piece of data
The more certain an outcome, the less information that it contains and vice-versa
More quantitatively, the information is [inversely] related the probability of an outcome
is the commonly used measure for information
For a variable (event), X, with n possible values (outcomes), x1, x2 …, xn
and each outcome having probability, p1, p2 …, pn
the entropy of X , H(X), is given by
0 ≤ H(X) ≤ log2n and is measured in bits.
Thus, entropy is a measure of how many bits it takes to represent an observation of X on average. Not likely an integer!
For a coin with probability p for heads and probability q = 1 – p for tails
H = - p log2 p - q log2 q
For a fair coin, p= 0.5, q = 0.5 H = 1
For a weighted coin, p = 1 or q = 1, H = 0
A more realistic example
-p log2 p
Maximum entropy is log25 = 2.3219, that is, we need more than 2 bits for 5 unique values, but not more than 3
In general, suppose we have a number of observations (m) of some attribute, X, e.g., the hair color of students in the class,
where there are n different possible values.
And the number of observations in the ith category is mi , thus the probability is mi / m
Then, for this sample, the entropy is
For continuous data, the calculation is harder.
Measures the degree to which data objects are close to each other in a specified area
The notion of density is closely related to that of proximity
Concept of density is typically used for clustering and anomaly detection
Simplest approach is to divide region into a number of rectangular cells of equal volume,
then define density as # of points the cell contains.
Euclidean density is the number of points within a specified radius of the point