DM 352 Syllabus | DM 552 Syllabus
last updated 23-Jun-2021
Instance Based Classifiers -- delay the model processing/creation until the test or observation is ready to be classified.
If it walks like a duck, quacks like a duck, then it’s probably a duck
Requires three things
To classify an unknown record:
K-nearest neighbors of a record x are data points that have the k smallest distances to x
1 nearest-neighbor using a Voroni diagram
Compute distance between two points.
We can use the choices of the similarity unit here.
Determining the class from the nearest neighbors list
Balancing the choice of k:
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
- height of a person may vary from 1.5m to 1.8m
- weight of a person may vary from 90lb to 300lb
- income of a person may vary from $10K to $1M
Selection of the best similarity measure is critical
k-NN classifiers are lazy learners since they do not build models explicitly, a priori
Classifying unknown records are relatively expensive--search training set for k-NN
Can result in arbitrarily shaped decision boundaries (adv or disadv?)
Easy to handle variable interactions since the decisions are based on local information
Selection of right proximity measure is essential
Superfluous or redundant attributes can create problems
Missing attributes are hard to handle
Avoid having to compute distance to all objects in the training set. Possible techniques include
Condensing: Determine a smaller set of objects that give the same performance
Editing: Remove objects to improve efficiency
From Wikipedia: A 3-dimensional k-d tree. The first split (the red vertical plane) cuts the root cell (white) into two subcells, each of which is then split (by the green horizontal planes) into two subcells. Finally, those four cells are split (by the four blue vertical planes) into two subcells. Since there is no more splitting, the final eight are called leaf cells.
A generalized BST. The training data would be preprocessed into the tree.
The observation to be classified is then searched to find the nearest neighbors in the kd tree