Classification

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Classification Definition

Given a collection of records, build a model that can be used to predict a designated class variable.

For the collection or a subset of the collection called the training set:

Each record of the collection (training set) is by characterized by a tuple (x,y), where x is the attribute set and y is the class label.

Other paired terms used for x and y:

Task of classification:

Learn/build a model that maps each attribute set x into one of the predefined class labels y

Examples

Task

Attribute set, x

Class label, y

Categorizing email messages

Features extracted from email message header and content

spam or non-spam

Identifying tumor cells

Features extracted from MRI scans

malignant or benign cells

Cataloging galaxies

Features extracted from telescope images

Elliptical, spiral, or irregular-shaped galaxies

General process for classification model building

An induction process builds the model (learn)

A deduction process tests the model for accuracy/quality


Classification Techniques

Base Classifiers

Input training set used in classification model to predict an output

Ensemble Classifiers

Use a combination of different classifiers or variations of the same classifier


Decision Tree example


Apply the model to test data

Click through the steps to see the animated

Step 0 Step 1 Step 2 Step 3 Step 4 Step 5


Decision Tree Classification Task

 

Many Algorithms:


Hunt's Algorithm

Let Dt be the set of training records that reach a node t

General Procedure:

Choosing which attribute to use for splitting is considered below.

This example magically chooses a good one.

 


Design Issues of Decision Tree Induction

How should training records be split?

How should the splitting procedure stop?

Methods for Expressing Test Conditions

Binary type

Simple splitting binary

Nominal

Multi-way: use as many partitions as values in nominal/binary cases

Versus binary: can have variations on different combinations of groups, for example

Ordinal

Multi-way: use as many partitions as distinct values

Binary split: Divides values into two subsets

See bad choice on grouping on right

 

Continuous

Have to choose a splitting value.

Different ways of handling discretization to form an ordinal categorical attribute

Binary Decision: (A < v) or (A != v )


Determining the best split

That is, what attribute provides the best split?  And what is "best"?

So which would the be the best split?

Greedy Approach

Nodes with a purer class distribution are preferred. So we need a meausure of node impurity.

Example measures of impurity

Gini index:
GINI(C0=5,C1=5) = 1 - (0.5)2 - (0.5)2  = 0.5
GINI(C0=9,C1=1) = 1 - (0.9)2- (0.1)2 = 0.18

Entropy index:

Entropy(C0=5,C1=5) = -(0.5)log(0.5) + -(0.5)log(0.5) = -0.5*(-1) + -0.5*(-1) = 1
Entropy(C0=9,C1=1) = -(0.9)log(0.9) + -(0.1)log(0.1) = -0.9(-0.152) + -0.1(-3.322) = 0.469

Misclassification error index:
Error(C0=5) = 1 - max(0.5,0.5) = 1  - 0.5 = 0.5
Error(C0=9,C1=1) = 1 - max(0.9,0.1) = 1 - 0.9 = 0.1

These approaches are discussed further below

Algorithm for determining best split

  1. Compute impurity measure (P) before splitting
  2. Compute impurity measure (M) after splitting
  3. Choose the attribute test condition that produces the highest gain:
    Gain = P - M

or equivalently, find the lowest impurity measure after splitting (M)

Generically, the process comparing gain between two (or more) attributes:

Choose attribute with highest Gain.


Building the decision tree using GINI as the impurity measure

is the GINI index for a given node t of class j. That is, p( j | t) is the relative frequency of class j at node t.

For 2-class problem (p, 1 – p): GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

These are the only combintions:  split [x,y] =  split [y,x]

With a split of [0,6]: P(C1) = 0/6 = 0   P(C2) = 6/6 = 1, then Gini([0,6]) = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0.000

With a split of [1,5]: P(C1) = 1/6 and P(C2) = 5/6, then Gini([1,5]) = 1 – (1/6)2 – (5/6)2 = 0.278

With a split of [2,4]: P(C1) = 1/3 and P(C2) = 2/3, then Gini([2,4]) = 1 – (1/3)2 – (2/3)2 = 0.444

With a split of [3,3: P(C1) = 1/2 and P(C2) = 1/2, then Gini([3,3]) = 1 – (1/2)2 – (1/2)2 = 0.500


Computing the Gini Index for a Collection of Nodes

When a node p is split into k partitions (k children)

where, ni = number of records at child i, and n = number of records at parent node p,

choose the attribute that minimizes weighted average Gini index of the children. The smallest Gini index gives you the highest gain.

Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT

Example GINI index for binary attributes

Example GINI index for categorical attributes

Which is best?


Example GINI index for continuous attributes

Use binary decisions based on one value

Several choices for the splitting value

Each splitting value has a count matrix associated with it

Simple method to choose best partition v

More efficient computation:

Foreach attribute:

 


Building the decision tree using Entropy as the impurity measure

 

is the entropy for a given node t of class j. That is, p( j | t) is the relative frequency of class j at node t.

Maximum is log nc when records are equally distributed among all classes implying least information

Minimum is 0.0 when all records belong to one class, implying most information

Entropy based computations are quite similar to the GINI index computations, but more computationally intensive.  (GINI squares the probability while Entropy multiplies the probability times its log.  Calculating a  log requires much more work computationally.)

Example

With a split of [0,6]: P(C1) = 0/6 = 0   P(C2) = 6/6 = 1, then Entropy([0,6]) = – 0 log 0 – 1 log 1 = – 0 – 0 = 0.000

With a split of [1,5]: P(C1) = 1/6 and P(C2) = 5/6, then Entropy([1,5]) = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

With a split of [2,4]: P(C1) = 2/6 and P(C2) = 4/6, then Entropy([2,4]) = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

With a split of [3,3: P(C1) = 1/2 and P(C2) = 1/2, then Entropy([3,3]) = – (1/2) log2 (1/2) – (1/2) log2 (1/2) = 1.00


Computing Information Gain with Entropy

Parent Node, p, is split into k partitions where ni is number of records in partition i

Choose the split that achieves most reduction, which maximizes GAIN.

Entropy is used in the ID3 and C4.5 decision tree algorithms.

 


Problem with large number of partitions

Node impurity measures tend to prefer splits that result in large number of partitions, each being small but pure.

Customer ID has highest information gain because entropy for all the children is zero. But we know this is not a predictor attribute.


Gain Ratio

This can help cancel out the above problem.

Parent Node, p, is split into k partitions where niis number of records in partition i

The ratito adjusts Information Gain by the entropy of the partitioning (SplitINFO).

Higher entropy partitioning (large number of small partitions) is penalized!

Used in C4.5 algorithm, designed to overcome the disadvantage of Information Gain

 


Building the decision tree using Classification Error as the impurity measure

  is the Classification error at a node t

Maximum of Error is (1 - 1/nc) when records are equally distributed among all classes implying least information

Minimum of Error is 0.0 when all records belong to one class, implying most information

With a split of [0,6]: P(C1) = 0/6 = 0   P(C2) = 6/6 = 1, then Error([0,6]) = 1 – max (0, 1) = 1 – 1 = 0.000

With a split of [1,5]: P(C1) = 1/6 and P(C2) = 5/6, then Error([1,5]) = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

With a split of [2,4]: P(C1) = 2/6 and P(C2) = 4/6, then Error([2,4]) = Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

With a split of [3,3: P(C1) = 1/2 and P(C2) = 1/2, then Error([3,3]) = Error = 1 – max (1/2, 1/2) = 1 – 1/2 = 1/2

Comparison among Impurity Measures

for a 2-class problem:


Misclassification Error vs Gini Index

 


Algorithm to build Decision Trees

This is known as the C4.5 algorithm

Let T be the set of training instances

  1. Choose an attribute that best differentiates the instances in T; "best" will be defined below.

  2. Create a tree node for this best attribute. Create child links from the node where each link represent unique values (or ranges) for the attribute.

  3. Use the child link values to further subdivide the instances into subclasses.
    For each subclass
    1. If the instances in the subclass satisfy predefined criteria, or if the set of remaining attribute choices for this path of the tree is null, specify the classification for new instances following this decision path.
    2. If the subclass does not satisy the predefined criteria and there is at least one attribute to further subdivide the path of the tree, reset T to the current set of subclass instances and repeat the algorithm.

Advantages:

Disadvantages:


Example: weather decision tree

To play or not depending on weatherweather decision tree

Which attributes to select? Here's a graphical breakdown of the attributes.

Which attributes to select?


Entropy calculations for this example

Recall:

entropy(p1,p2, ....,pn) = p1*b1+ p2*b2... + pn*bn

Note that bi = - log2 pi in the cases of a simple (binary) split into two classes.
entropy(p1,p2, ....,pn) = -p1logp1 - p2logp2... - pnlogpn

Entropy for "Outlook" = Sunny: Which attributes to select?

Entropy for "Outlook" = Overcast:

Entropy for "Outlook" = Rainy:

Entropy for "Outlook" over all its values:


Information Gain for this example

The value (or contribution to information) of an attribute is calculated as
gain(attr) = (information before split) - (information after split)

Which attributes to select? gain("Outlook") = info([9,5]) - info([2,3], [4,0], [3,2])
= 0.940-0.693 = 0.247 bits

The gains for the other attributes doing a similar calculation:

 

 

So choose "Outlook" as the "best" attribute. Now recompute gain on the remaining three attributes and repeat process within the "Outlook" categories.

So for the Sunny case:

weather gain on other attributes

So choose "Humidity" as a best attribute for the sunny case.

weather final decision tree

And "Windy" for the rainy case. The gains were not shown