Cluster Analysis

DM 352 Syllabus | DM 552 Syllabus

last updated 12-Aug-2021

Introduction

Another unsupervised data mining technique. --No targeted class variable identified.

Cluster Analysis: Finding groups of objects such that the objects in a group are similar (or related) to one another and different from (or unrelated to) the objects in other group.

Usually characterize each cluster using means, medians, modes of the attributes for the instances in the cluster.

Applications

Understanding: Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations

Summarization: reduce the size of a large dataset

Cluster analysis is NOT

Simple segmentation: Dividing students into different registration groups alphabetically, by last name

Results of a query: Groupings are a result of an external specification. Clustering, instead, is a grouping of objects based on the data.

Supervised classification: Have class label information

Association Analysis: Local vs. global connections

The notion of a cluster can be ambiguous.  How many? How close are the elements in a cluster?


Clustering

A set of clusters

Partitional Clustering:

A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset

Hierarchical clustering:

A set of nested clusters organized as a hierarchical tree


Other distinction between sets of clusters

Exclusive versus non-exclusive

Fuzzy versus non-fuzzy

Partial versus complete

Heterogeneous versus homogeneous


Types of clusters

Well-separated clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster.

Center-based clusters

Contiguous clusters (Nearest neighbor or Transitive)

A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

Density-based clusters

Shared Property or Conceptual clusters

clusters that share some common property or represent a particular concept

 

Described by an Objective Function

Finds clusters that minimize or maximize an objective function.

Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. (NP Hard algorithm complexity)

Can have global or local objectives.

A variation of the global objective function approach is to fit the data to a parameterized model.

Examples:

Sum of Squared Error (SSE) using Eucliean distances

Document data--document term matrix and using cosine similarity measure.

 


Characteristics of the input data are important

Type of proximity or density measure

Data characteristics that affect proximity and/or density are

Noise and Outliers


Algorithms

K-means and its variants

Hierarchical clustering

Density-based clustering


K-means clustering

Partitional clustering approach

Number of clusters, K, must be specified

Each cluster is associated with a centroid (center point)

Each point is assigned to the cluster with the closest centroid


K-means details

Initial centroids are often chosen randomly.

The centroid is (typically) the mean of the points in the cluster.

‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.

K-means will converge for common similarity measures mentioned above.

Most of the convergence happens in the first few iterations.

Complexity is O( n * K * I * d )

Clustering Animations at https://www.naftaliharris.com/blog/visualizing-k-means-clustering/


Evaluating K-means clusters

Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest cluster

To get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci

Given two sets of clusters, we prefer the one with the smallest error

One easy way to reduce SSE is to increase K, the number of clusters


Evaluating the choice of k in k-means

Essentially run the k-means algorithm with increasing k.  Graph the SSE against k and look for the point that SSE stops decreasing significantly-- the "elbow bend" point and choose your k there.  Of course this is a bit subjective.

Here are two sample graphs that reflect the idea.  Dataset A has a clear k of 3.  Dataset B is not so clear: 2 or 3 or 4? or 6 or 7?  Possibly 4 in this case.

[Image taken from URL: https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb]

 


Limitations

K-means has problems when clusters are of differing

K-means also has problems when the data contains outliers.


Overcoming limitations



Importance of Choosing Initial Centroids...

Good initial centroids

Not so good....


Problems with selecting initial points

If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.

10 clusters example

 

Starting with some pairs of clusters having three initial centroids, while other have only one

Some solutions to initial choices

Multiple runs - Helps, but probability is not on your side

Sample and use hierarchical clustering to determine initial centroids

Select more than k initial centroids and then select among these initial centroids

Postprocessing

Generate a larger number of clusters and then perform a hierarchical clustering

Bisecting K-means

Bisecting K-means algorithm

Variant of K-means that can produce a partitional or a hierarchical clustering

K-means++

This approach can be slower than random initialization, but very consistently produces better results in terms of SSE

The k-means++ algorithm guarantees an approximation ratio of O(log k) in expectation, where k is the number of centers

To select a set of initial centroids, C, perform the following


Additional Issues

Empty Clusters

K-means can yield empty clusters.

Several strategies to resolve the problem.

Updating Centroids Incrementally

In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid

An alternative is to update the centroids after each assignment (incremental approach)

Pre-processing

Normalize the data

Eliminate outliers

Post-processing

Eliminate small clusters that may represent outliers

Split ‘loose’ clusters, i.e., clusters with relatively high SSE

Merge clusters that are ‘close’ and that have relatively low SSE

Can use these steps during the clustering process (ISODATA)