DBSCAN and Misc. clustering topics

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Introduction Density-Based cluster algorithms

DBSCAN is a density-based algorithm. Instances are part of a cluster based on density.

Density = number of points within a specified radius (Eps, epsilon) along with a minimum number of points (MinPts).

A point is a core point if it has at least a specified number of points (MinPts) within Eps

A border point is not a core point, but is in the neighborhood of a core point--that is, there are not MinPts near this border point and no further growth can happen.

A noise point is any point that is not a core point or a border point

Example on right is using MinPts = 7

Below is an example with minPts=4 and how the cluster is grown transitively as you add points reachable within the Eps range, which can be used to reach farther points. [Taken from https://en.wikipedia.org/wiki/DBSCAN]

So what happens is shown below. The densely clustered instances are in green, the border or edges are in blue, and noise in red.

And the clusters are separable:


DBSCAN

The DBSCAN algorithm can be abstracted into the following steps:[Wikipedia]

  1. Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than minPts neighbors.
  2. Find the connected components of core points on the neighbor graph, ignoring all non-core points.
  3. Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise.

A naive implementation of the above requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time--as outlined below:

Determine/label core points... those which have MinPts neighbors within range of Eps

Determine/label border points... those that count as a core point neighbor but doesn't itself fulfill MinPts

Eliminate noise points

Perform clustering on the core points:

Algorithm needs to merge the cluster labeling in the transitive connection case. What if the point has a cluster label that is different than is current? ... or is that not possible?

Advantages:

Disadvantages:


Determining Eps and MinPts

MinPts: As a rule of thumb, a minimum minPts can be derived from the number of dimensions D in the data set, as minPts ≥ D + 1. minPts must be chosen at least 3. However, larger values are usually better for data sets with noise and will yield more significant clusters. As a rule of thumb, minPts = 2·dim can be used,[6]but it may be necessary to choose larger values for very large data, for noisy data or for data that contains many duplicates. [Wikipedia]

ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value.[5] Good values of ε are where this plot shows an "elbow":[1][6][5] if ε is chosen much too small, a large part of the data will not be clustered; whereas for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster. In general, small values of ε are preferable,[5] and as a rule of thumb only a small fraction of points should be within this distance of each other.

The assumption is that for points in a cluster, their k-th nearest neighbors are at roughly the same distance

Noise points have their k-th nearest neighbors at farther distance

So, plot sorted distance of every point to its k-th nearest neighbor

 

DBSCAN Animation at https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/