# Decision Trees

last updated 11/10/10

## Algorithm to build Decision Trees

This is known as the C4.5 algorithm

Let T be the set of training instances

1. Choose an attribute that best differentiates the instances in T; "best" will be defined below.

2. Create a tree node for this best attribute. Create child links from the node where each link represent unique values (or ranges) for the attribute.

3. Use the child link values to further subdivide the instances into subclasses.
For each subclass
1. If the instances in the subclass satisfy predefined criteria, or if the set of remaining attribute choices for this path of the tree is null, specify the classification for new instances following this decision path.
2. If the subclass does not satisy the predefined criteria and there is at least one attribute to further subdivide the path of the tree, reset T to the current set of subclass instances and repeat the algorithm.

Note that the attribute choices made when building a decision tree determine the size of the constructed tree. We want to minimize the number of tree levels and tree nodes to maximize data generalization.

Idea behind choosing the best attribute is to select the attribute that splits the data so as to show the largest amount of gain information.

## Example: weather decision tree  Which attributes to select? Here's a graphical breakdown of the attributes. ## Information Theory

Information is measured in bits. Roughly the number of bits is used to represent the variability of some information/data. For example, to represent 4 values (or 4 different classifications) we need 2 bits. 5 to 8 classifications needs 3 bits. The bit measure needs not be an integer in our calculations below.

For a given a probability distribution of an event (distribution split of results across an attribute), the information required to predict an event is called the distribution’s entropy.

Entropy gives the information required in bits and will involve fractions of bits for measurement purposes.

Entropy is calculated as: the sum of the conditional probabilities of an event (pi) times its information required for the event in bits (bi) .

entropy(p1,p2, ....,pn) = p1*b1+ p2*b2... + pn*bn

Note that bi = - log2 pi in the cases of a simple (binary) split into two classes.
entropy(p1,p2, ....,pn) = -p1logp1 - p2logp2... - pnlogpn

Entropy for "Outlook" = Sunny: • info([2,3]) = entropy(2/5, 3/5)
= -0.4log(0.4) - 0.6log(0.6) = 0.971 bits

Entropy for "Outlook" = Overcast:

• info([4,0]) = entropy(1,0)
= -1log1 - 0log0 = 0 bits,
although log0 is undefined we assume 0.

Entropy for "Outlook" = Rainy:

• info([3,2]) = entropy(3/5, 2/5)
= -0.6log(0.6) - 0.4log(0.4) = 0.971 bits

Entropy for "Outlook" over all its values:

• info([2,3], [4,0], [3,2])
= (5/14)*0.971 + (4/14)*0 + (5/14)*0.971 = 0.693 bits

## Information Gain

The value (or contribution to information) of an attribute is calculated as
gain(attr) = (information before split) - (information after split) gain("Outlook") = info([9,5]) - info([2,3], [4,0], [3,2])
= 0.940-0.693 = 0.247 bits

The gains for the other attributes doing a similar calculation:

• gain("Temperature") = 0.029 bits
• gain("Humidity") = 0.152 bits
• gain("Windy") = 0.048 bits

So choose "Outlook" as the "best" attribute. Now recompute gain on the remaining three attributes and repeat process within the "Outlook" categories.

So for the Sunny case: So choose "Humidity" as a best attribute for the sunny case. And "Windy" for the rainy case. The gains were not shown

## Decision Tree Rules Rules are a simpler form of the decision tree. Can be easily expressed as text.

IF Age<=43 && Sex=Male
&& CreditCardInsurance = No
THEN LifeInsurancePromo = No

• here the antecedent conditions cover 4 of the 15 training set instances with 75% accuracy.

IF Sex=Male
&& CreditCardInsurance = No
THEN LifeInsurancePromo = No

• removing one antecedent condition, we have 6 instances with 5 covering for 83% accuracy

Tools will typically create rules then simplify.

## Other Considerations

Regression trees take the form of decision trees. Here the leaf nodes are numerical rather than categorical outcomes.

• easy to understand
• map nicely to a set of production rules
• make no prior assumptions about the nature of the data
• can build models with datasets containing categorical and numerical data (convert to categories?)

Issues:

• outputs must be categorical
• no multiple outputs (unless the categories of the outputs can be paired up)
• algorithms are unstable, as minor variations in the training data can cause different results (a change in an internal node will propogate down the tree)
• trees using numerical datasets may be complex as the attributes splits are binary (less than or greater than a value)
--categorical attributes may have multiple children, potentially flattening the tree