DM 352 Syllabus | DM 552 Syllabus
last updated 23-Jun-2021
DBSCAN is a density-based algorithm. Instances are part of a cluster based on density.
Density = number of points within a specified radius (Eps, epsilon) along with a minimum number of points (MinPts).
A point is a core point if it has at least a specified number of points (MinPts) within Eps
A border point is not a core point, but is in the neighborhood of a core point--that is, there are not MinPts near this border point and no further growth can happen.
A noise point is any point that is not a core point or a border point
Example on right is using MinPts = 7
Below is an example with minPts=4 and how the cluster is grown transitively as you add points reachable within the Eps range, which can be used to reach farther points. [Taken from https://en.wikipedia.org/wiki/DBSCAN]
So what happens is shown below. The densely clustered instances are in green, the border or edges are in blue, and noise in red.
And the clusters are separable:
The DBSCAN algorithm can be abstracted into the following steps:[Wikipedia]
A naive implementation of the above requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time--as outlined below:
Determine/label core points... those which have MinPts neighbors within range of Eps
Determine/label border points... those that count as a core point neighbor but doesn't itself fulfill MinPts
Eliminate noise points
Perform clustering on the core points:
Algorithm needs to merge the cluster labeling in the transitive connection case. What if the point has a cluster label that is different than is current? ... or is that not possible?
MinPts: As a rule of thumb, a minimum minPts can be derived from the number of dimensions D in the data set, as minPts ≥ D + 1. minPts must be chosen at least 3. However, larger values are usually better for data sets with noise and will yield more significant clusters. As a rule of thumb, minPts = 2·dim can be used,but it may be necessary to choose larger values for very large data, for noisy data or for data that contains many duplicates. [Wikipedia]
ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good values of ε are where this plot shows an "elbow": if ε is chosen much too small, a large part of the data will not be clustered; whereas for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster. In general, small values of ε are preferable, and as a rule of thumb only a small fraction of points should be within this distance of each other.
The assumption is that for points in a cluster, their k-th nearest neighbors are at roughly the same distance
Noise points have their k-th nearest neighbors at farther distance
So, plot sorted distance of every point to its k-th nearest neighbor
DBSCAN Animation at https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/