last updated 11/10/10

This is known as the C4.5 algorithm

Let T be the set of training instances

- Choose an attribute that
*best*differentiates the instances in T; "best" will be defined below.

- Create a tree node for this
*best*attribute. Create child links from the node where each link represent unique values (or ranges) for the attribute.

- Use the child link values to further subdivide the instances into subclasses.

For each subclass- If the instances in the subclass satisfy predefined criteria, or if the set of remaining attribute choices for this path of the tree is null, specify the classification for new instances following this decision path.
- If the subclass does not satisy the predefined criteria and there is at least one attribute to further subdivide the path of the tree, reset T to the current set of subclass instances and repeat the algorithm.

Note that the attribute choices made when building a decision tree determine the size of the constructed tree. We want to minimize the number of tree levels and tree nodes to maximize data generalization.

Idea behind choosing the *best* attribute is to select the attribute that splits the data so as to show the largest amount of *gain* information.

Which attributes to select? Here's a graphical breakdown of the attributes.

Information is measured in bits. Roughly the number of bits is used to represent the variability of some information/data. For example, to represent 4 values (or 4 different classifications) we need 2 bits. 5 to 8 classifications needs 3 bits. The bit measure needs not be an integer in our calculations below.

For a given a probability distribution of an event (distribution split of results across an attribute), the information required to predict an event is called the distribution’s * entropy*.

Entropy gives the information required in bits and will involve fractions of bits for measurement purposes.

Entropy is calculated as: the sum of the conditional probabilities of an event (

p) times its information required for the event in bits (_{i}b) ._{i}

entropy(p,_{1}p, ....,_{2}p) =_{n}p*_{1}b... +_{1}+ p_{2}*b_{2}p_{n}*b_{n}Note that

b_{i}= - log_{2}pin the cases of a simple (binary) split into two classes._{i}

entropy(p,_{1}p, ....,_{2}p) = -_{n}plog_{1}p-_{1}plog_{2}p... -_{2}plog_{n}p_{n}

Entropy for "Outlook" = Sunny:

- info([2,3]) = entropy(2/5, 3/5)

= -0.4log(0.4) - 0.6log(0.6) =**0.971 bits**

Entropy for "Outlook" = Overcast:

- info([4,0]) = entropy(1,0)

= -1log1 - 0log0 =**0 bits,**

although log0 is undefined we assume 0.

Entropy for "Outlook" = Rainy:

- info([3,2]) = entropy(3/5, 2/5)

= -0.6log(0.6) - 0.4log(0.4) =**0.971 bits**

Entropy for "Outlook" over all its values:

- info([2,3], [4,0], [3,2])

= (5/14)*0.971 + (4/14)*0 + (5/14)*0.971 =**0.693 bits**

The value (or contribution to information) of an attribute is calculated as

gain(attr) = (information before split) - (information after split)

**gain("Outlook") **= info([9,5]) - info([2,3], [4,0], [3,2])

= 0.940-0.693 =** 0.247 bits**

The gains for the other attributes doing a similar calculation:

**gain("Temperature") = 0.029 bits****gain("Humidity") = 0.152 bits****gain("Windy") = 0.048 bits**

So choose "Outlook" as the "best" attribute. Now recompute gain on the remaining three attributes and repeat process within the "Outlook" categories.

So for the Sunny case:

So choose "Humidity" as a best attribute for the sunny case.

And "Windy" for the rainy case. The gains were not shown

Rules are a simpler form of the decision tree. Can be easily expressed as text.

IF Age<=43 && Sex=Male

&& CreditCardInsurance = No

THEN LifeInsurancePromo = No

- here the antecedent conditions cover 4 of the 15 training set instances with 75% accuracy.

IF Sex=Male

&& CreditCardInsurance = No

THEN LifeInsurancePromo = No

- removing one antecedent condition, we have 6 instances with 5 covering for 83% accuracy

Tools will typically create rules then simplify.

Regression trees take the form of decision trees. Here the leaf nodes are numerical rather than categorical outcomes.

Advantages of decision trees:

- easy to understand
- map nicely to a set of production rules
- make no prior assumptions about the nature of the data
- can build models with datasets containing categorical and numerical data (convert to categories?)

Issues:

- outputs must be categorical
- no multiple outputs (unless the categories of the outputs can be paired up)
- algorithms are unstable, as minor variations in the training data can cause different results (a change in an internal node will propogate down the tree)
- trees using numerical datasets may be complex as the attributes splits are binary (less than or greater than a value)

--categorical attributes may have multiple children, potentially flattening the tree