Nearest Neighbor Classifiers

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Introduction

Instance Based Classifiers -- delay the model processing/creation until the test or observation is ready to be classified.

Examples

If it walks like a duck, quacks like a duck, then it’s probably a duck


Nearest-Neighbor Classifiers

Requires three things

  1. Training set: the set of labeled records
  2. A Distance Metric: to compute distance between records
  3. Some value of k, the number of nearest neighbors to retrieve / consider

To classify an unknown record:

K-nearest neighbors of a record x are data points that have the k smallest distances to x

1 nearest-neighbor using a Voroni diagram


Nearest Neighbor Classification

Compute distance between two points.

We can use the choices of the similarity unit here.

Considerations

Determining the class from the nearest neighbors list

Balancing the choice of k:

Scaling issues

Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes

Example:

Selection of the best similarity measure is critical

Other issues with k-NN Classification

k-NN classifiers are lazy learners since they do not build models explicitly, a priori

Classifying unknown records are relatively expensive--search training set for k-NN

Can result in arbitrarily shaped decision boundaries (adv or disadv?)

Easy to handle variable interactions since the decisions are based on local information

Selection of right proximity measure is essential

Superfluous or redundant attributes can create problems

Missing attributes are hard to handle

 


Improving k-NN Efficiency

Avoid having to compute distance to all objects in the training set. Possible techniques include

Condensing: Determine a smaller set of objects that give the same performance

Editing: Remove objects to improve efficiency

Proximity graphs:

k-d Treehttps://upload.wikimedia.org/wikipedia/commons/b/b6/3dtree.png

From Wikipedia: A 3-dimensional k-d tree. The first split (the red vertical plane) cuts the root cell (white) into two subcells, each of which is then split (by the green horizontal planes) into two subcells. Finally, those four cells are split (by the four blue vertical planes) into two subcells. Since there is no more splitting, the final eight are called leaf cells.

A generalized BST. The training data would be preprocessed into the tree.

The observation to be classified is then searched to find the nearest neighbors in the kd tree