DM 352 Syllabus | DM 552 Syllabus
last updated 23-Jun-2021
Given a collection of records, build a model that can be used to predict a designated class variable.
For the collection or a subset of the collection called the training set:
Each record of the collection (training set) is by characterized by a tuple (x,y), where x is the attribute set and y is the class label.
Other paired terms used for x and y:
- x: attribute, predictor, independent variable, input
- y: class, response, dependent variable, output
Task of classification:
Learn/build a model that maps each attribute set x into one of the predefined class labels y
Attribute set, x
Class label, y
Categorizing email messages
Features extracted from email message header and content
spam or non-spam
Identifying tumor cells
Features extracted from MRI scans
malignant or benign cells
Features extracted from telescope images
Elliptical, spiral, or irregular-shaped galaxies
An induction process builds the model (learn)
A deduction process tests the model for accuracy/quality
Input training set used in classification model to predict an output
- Decision Tree based Methods
- Rule-based Methods
- Neural Networks
- Deep Learning
- Naïve Bayes and Bayesian Belief Networks
- Support Vector Machines
Use a combination of different classifiers or variations of the same classifier
Click through the steps to see the animated
Step 0 Step 1 Step 2 Step 3 Step 4 Step 5
Let Dt be the set of training records that reach a node t
Choosing which attribute to use for splitting is considered below.
This example magically chooses a good one.
How should training records be split?
How should the splitting procedure stop?
Methods for Expressing Test Conditions
Simple splitting binary
Multi-way: use as many partitions as values in nominal/binary cases
Versus binary: can have variations on different combinations of groups, for example
Multi-way: use as many partitions as distinct values
Binary split: Divides values into two subsets
See bad choice on grouping on right
Have to choose a splitting value.
Different ways of handling discretization to form an ordinal categorical attribute
Binary Decision: (A < v) or (A != v )
That is, what attribute provides the best split? And what is "best"?
So which would the be the best split?
Nodes with a purer class distribution are preferred. So we need a meausure of node impurity.
GINI(C0=5,C1=5) = 1 - (0.5)2 - (0.5)2 = 0.5
GINI(C0=9,C1=1) = 1 - (0.9)2- (0.1)2 = 0.18
Entropy(C0=5,C1=5) = -(0.5)log(0.5) + -(0.5)log(0.5) = -0.5*(-1) + -0.5*(-1) = 1
Entropy(C0=9,C1=1) = -(0.9)log(0.9) + -(0.1)log(0.1) = -0.9(-0.152) + -0.1(-3.322) = 0.469
Misclassification error index:
Error(C0=5) = 1 - max(0.5,0.5) = 1 - 0.5 = 0.5
Error(C0=9,C1=1) = 1 - max(0.9,0.1) = 1 - 0.9 = 0.1
These approaches are discussed further below
or equivalently, find the lowest impurity measure after splitting (M)
Generically, the process comparing gain between two (or more) attributes:
Choose attribute with highest Gain.
is the GINI index for a given node t of class j. That is, p( j | t) is the relative frequency of class j at node t.
For 2-class problem (p, 1 – p): GINI = 1 – p2 – (1 – p)2 = 2p (1-p)
These are the only combintions: split [x,y] = split [y,x]
With a split of [0,6]: P(C1) = 0/6 = 0 P(C2) = 6/6 = 1, then Gini([0,6]) = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0.000
With a split of [1,5]: P(C1) = 1/6 and P(C2) = 5/6, then Gini([1,5]) = 1 – (1/6)2 – (5/6)2 = 0.278
With a split of [2,4]: P(C1) = 1/3 and P(C2) = 2/3, then Gini([2,4]) = 1 – (1/3)2 – (2/3)2 = 0.444
With a split of [3,3: P(C1) = 1/2 and P(C2) = 1/2, then Gini([3,3]) = 1 – (1/2)2 – (1/2)2 = 0.500
When a node p is split into k partitions (k children)
where, ni = number of records at child i, and n = number of records at parent node p,
choose the attribute that minimizes weighted average Gini index of the children. The smallest Gini index gives you the highest gain.
Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT
Which is best?
Use binary decisions based on one value
Several choices for the splitting value
Each splitting value has a count matrix associated with it
Simple method to choose best partition v
More efficient computation:
is the entropy for a given node t of class j. That is, p( j | t) is the relative frequency of class j at node t.
Maximum is log nc when records are equally distributed among all classes implying least information
Minimum is 0.0 when all records belong to one class, implying most information
Entropy based computations are quite similar to the GINI index computations, but more computationally intensive. (GINI squares the probability while Entropy multiplies the probability times its log. Calculating a log requires much more work computationally.)
With a split of [0,6]: P(C1) = 0/6 = 0 P(C2) = 6/6 = 1, then Entropy([0,6]) = – 0 log 0 – 1 log 1 = – 0 – 0 = 0.000
With a split of [1,5]: P(C1) = 1/6 and P(C2) = 5/6, then Entropy([1,5]) = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
With a split of [2,4]: P(C1) = 2/6 and P(C2) = 4/6, then Entropy([2,4]) = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
With a split of [3,3: P(C1) = 1/2 and P(C2) = 1/2, then Entropy([3,3]) = – (1/2) log2 (1/2) – (1/2) log2 (1/2) = 1.00
Parent Node, p, is split into k partitions where ni is number of records in partition i
Choose the split that achieves most reduction, which maximizes GAIN.
Entropy is used in the ID3 and C4.5 decision tree algorithms.
Node impurity measures tend to prefer splits that result in large number of partitions, each being small but pure.
Customer ID has highest information gain because entropy for all the children is zero. But we know this is not a predictor attribute.
This can help cancel out the above problem.
Parent Node, p, is split into k partitions where niis number of records in partition iThe ratito adjusts Information Gain by the entropy of the partitioning (SplitINFO).
Higher entropy partitioning (large number of small partitions) is penalized!
Used in C4.5 algorithm, designed to overcome the disadvantage of Information Gain
is the Classification error at a node t
Maximum of Error is (1 - 1/nc) when records are equally distributed among all classes implying least information
Minimum of Error is 0.0 when all records belong to one class, implying most information
With a split of [0,6]: P(C1) = 0/6 = 0 P(C2) = 6/6 = 1, then Error([0,6]) = 1 – max (0, 1) = 1 – 1 = 0.000
With a split of [1,5]: P(C1) = 1/6 and P(C2) = 5/6, then Error([1,5]) = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
With a split of [2,4]: P(C1) = 2/6 and P(C2) = 4/6, then Error([2,4]) = Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
With a split of [3,3: P(C1) = 1/2 and P(C2) = 1/2, then Error([3,3]) = Error = 1 – max (1/2, 1/2) = 1 – 1/2 = 1/2
for a 2-class problem:
This is known as the C4.5 algorithm
Let T be the set of training instances
Which attributes to select? Here's a graphical breakdown of the attributes.
entropy(p1,p2, ....,pn) = p1*b1+ p2*b2... + pn*bn
Note that bi = - log2 pi in the cases of a simple (binary) split into two classes.
entropy(p1,p2, ....,pn) = -p1logp1 - p2logp2... - pnlogpn
Entropy for "Outlook" = Sunny:
Entropy for "Outlook" = Overcast:
Entropy for "Outlook" = Rainy:
Entropy for "Outlook" over all its values:
The value (or contribution to information) of an attribute is calculated as
gain(attr) = (information before split) - (information after split)
gain("Outlook") = info([9,5]) - info([2,3], [4,0], [3,2])
= 0.940-0.693 = 0.247 bits
The gains for the other attributes doing a similar calculation:
So choose "Outlook" as the "best" attribute. Now recompute gain on the remaining three attributes and repeat process within the "Outlook" categories.
So for the Sunny case:
So choose "Humidity" as a best attribute for the sunny case.
And "Windy" for the rainy case. The gains were not shown