Decision Trees

last updated 11/10/10

Algorithm to build Decision Trees

This is known as the C4.5 algorithm

Let T be the set of training instances

  1. Choose an attribute that best differentiates the instances in T; "best" will be defined below.

  2. Create a tree node for this best attribute. Create child links from the node where each link represent unique values (or ranges) for the attribute.

  3. Use the child link values to further subdivide the instances into subclasses.
    For each subclass
    1. If the instances in the subclass satisfy predefined criteria, or if the set of remaining attribute choices for this path of the tree is null, specify the classification for new instances following this decision path.
    2. If the subclass does not satisy the predefined criteria and there is at least one attribute to further subdivide the path of the tree, reset T to the current set of subclass instances and repeat the algorithm.

Note that the attribute choices made when building a decision tree determine the size of the constructed tree. We want to minimize the number of tree levels and tree nodes to maximize data generalization.

Idea behind choosing the best attribute is to select the attribute that splits the data so as to show the largest amount of gain information.

Example: weather decision tree

To play or not depending on weatherweather decision tree

Which attributes to select? Here's a graphical breakdown of the attributes.

Which attributes to select?

Information Theory

Information is measured in bits. Roughly the number of bits is used to represent the variability of some information/data. For example, to represent 4 values (or 4 different classifications) we need 2 bits. 5 to 8 classifications needs 3 bits. The bit measure needs not be an integer in our calculations below.

For a given a probability distribution of an event (distribution split of results across an attribute), the information required to predict an event is called the distribution’s entropy.

Entropy gives the information required in bits and will involve fractions of bits for measurement purposes.

Entropy is calculated as: the sum of the conditional probabilities of an event (pi) times its information required for the event in bits (bi) .

entropy(p1,p2, ....,pn) = p1*b1+ p2*b2... + pn*bn

Note that bi = - log2 pi in the cases of a simple (binary) split into two classes.
entropy(p1,p2, ....,pn) = -p1logp1 - p2logp2... - pnlogpn

Entropy for "Outlook" = Sunny: Which attributes to select?

Entropy for "Outlook" = Overcast:

Entropy for "Outlook" = Rainy:

Entropy for "Outlook" over all its values:

Information Gain

The value (or contribution to information) of an attribute is calculated as
gain(attr) = (information before split) - (information after split)

Which attributes to select? gain("Outlook") = info([9,5]) - info([2,3], [4,0], [3,2])
= 0.940-0.693 = 0.247 bits

The gains for the other attributes doing a similar calculation:



So choose "Outlook" as the "best" attribute. Now recompute gain on the remaining three attributes and repeat process within the "Outlook" categories.

So for the Sunny case:

weather gain on other attributes

So choose "Humidity" as a best attribute for the sunny case.

weather final decision tree

And "Windy" for the rainy case. The gains were not shown

Decision Tree Rules

Credit card promotional database

Rules are a simpler form of the decision tree. Can be easily expressed as text.

IF Age<=43 && Sex=Male
&& CreditCardInsurance = No
THEN LifeInsurancePromo = No

IF Sex=Male
&& CreditCardInsurance = No
THEN LifeInsurancePromo = No

Tools will typically create rules then simplify.


Other Considerations

Regression trees take the form of decision trees. Here the leaf nodes are numerical rather than categorical outcomes.

Advantages of decision trees: