Hierarchical Clustering

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Introduction

Produces a set of nested clusters organized as a hierarchical tree

Can be visualized as a dendrogram

Advantages:

Do not have to assume any particular number of clusters

They may correspond to meaningful taxonomies

Two main types of hierarchical clustering

Agglomerative:

Divisive:

Traditional hierarchical algorithms use a similarity or distance matrix


Agglomerative Clustering Algorithm

Most popular hierarchical clustering technique

Basic algorithm is straightforward

  1. Compute the proximity matrix
  2. Let each data point be a cluster
  3. Repeat
    1. Merge the two closest clusters
    2. Update the proximity matrix
  4. Until only a single cluster remains

Key operation is the computation of the proximity of two clusters.

Different approaches to defining the distance between clusters distinguish the different algorithms

Starting point

After some merging we can have

Intermediate situation--we want to merge the two closet clusters, C2 and C5, then update the proximity matrix.

How to update the proximity matrix?

Possible definitions of inter-cluster similarity

MIN

MAX

Group Average

 

Distance Between Centroids

Other methods driven by an objective function: e.g., Ward’s Method uses squared error


MIN or Single Link

Proximity of two clusters is based on the two closest points in the different clusters. Determined by one pair of points, i.e., by one link in the proximity graph

Strength of MIN

Handling of non-elliptical shapes

Limitation

 


MAX or Complete Linkage

Proximity of two clusters is based on the two farthest points of the closest clusters. Determined by one pair of points, i.e., by one link in the proximity graph.

Strength: less susceptible to noise and outliers

Limitation of MAX

Group Average

Proximity of two clusters is the average of pairwise proximity between points in the two clusters.

Need to use average connectivity for scalability since total proximity favors large clusters

The above is actually Ward's Method.  Below is the average

This is a compromise between Single and Complete Link

Strengths: Less susceptible to noise and outliers

Limitations: Biased towards globular clusters


Ward's Method

Similarity of two clusters is based on the increase in squared error when two clusters are merged

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analog of K-means

 


Performance of hierarchical clusters

O(N2) space since it uses the proximity matrix, where N is the number of points.

O(N3) time in many cases

Once a decision is made to combine two clusters, it cannot be undone.

No global objective function is directly minimized

Different schemes have problems with one or more of the following: