# Rule-base Classifier

last updated 23-Jun-2021

## Introduction

We want to classify records by using a collection of simpler “if…then…” rules

Rule notation: (Condition) → y

where

• Condition is a conjunctions of attributes
• y is the class label
• LHS is the rule antecedent or condition
• RHS is the rule consequent

### Examples of classification rules:

• (Blood Type=Warm) & (Lay Eggs=Yes) → Birds
• (Taxable Income < 50K) & (Refund=Yes) → Evade=No

R1: (Give Birth = no) & (Can Fly = yes) → Birds

R2: (Give Birth = no) & (Live in Water = yes) → Fishes

R3: (Give Birth = yes) & (Blood Type = warm) → Mammals

R4: (Give Birth = no) & (Can Fly = no) → Reptiles

R5: (Live in Water = sometimes) → Amphibians

## Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

The rule R1 above covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal

A lemur triggers rule R3, so it is classified => Mammal

A turtle triggers both R4 and R5

A dogfish shark trigger matches none of the rules

## Measures of Coverage and Accuracy

Coverage of a rule:

• Fraction of all records that satisfy the antecedent of a rule
• Count(instances with antecedent) / Count(training set)
• Example on left: (Status = 'Single') -> no, Coverage = 4/10 = 40%

Accuracy of a rule:

• Fraction of records that satisfy the antecedent that also satisfy the consequent of a rule
• Count (instances with antecedent AND consequent) / Count(instances with antecedent)
• Example on left: (Status = 'Single') -> no, accuracy = 2/4 = 50%

## Characteristics of Rule Sets

Mutually exclusive rules

• Classifier contains mutually exclusive rules if the rules are independent of each other
• Every record is covered by at most one rule

Exhaustive rules

• Classifier has exhaustive coverage if it accounts for every possible combination of attribute values
• Each record is covered by at least one rule

Handling rules that are not mutually exclusive

• A record may trigger more than one rule
• Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

Handling rules that are not exhaustive

• A record may not trigger any rules
• Solution?
• Use a default class

## Ordered Rule Set

Rules are rank ordered according to their priority

• An ordered rule set is known as a decision list

When a test record is presented to the classifier

• It is assigned to the class label of the highest ranked rule it has triggered (first encountered)
• If none of the rules fired, it is assigned to the default class

Example from above

1. R1: (Give Birth = no) & (Can Fly = yes) → Birds
2. R2: (Give Birth = no) & (Live in Water = yes) → Fishes
3. R3: (Give Birth = yes) & (Blood Type = warm) → Mammals
4. R4: (Give Birth = no) & (Can Fly = no) → Reptiles
5. R5: (Live in Water = sometimes) → Amphibians

A turtle triggers both R4 and R5, but by the order, the conclusion is =>Reptile

We did not define a default rule.

### Rule ordering schemes

Rule-based ordering : Individual rules are ranked based on their quality

Class-based ordering: Rules that belong to the same class appear together, ordered by class sizes

• R1: Accuracy = 100%, coverage = 30%
• R2: Accuracy = 100%, coverage = 10%
• R3: Accuracy = 100%, coverage = 30%
• R4: Accuracy = 100%, coverage = 30%

In this example the ordering criteria is not clear.

## Building Classification Rules

Direct Method:

• Extract rules directly from data
• Examples: RIPPER, CN2, Holte’s 1R

Indirect Method:

• Extract rules from other classification models (e.g. decision trees, neural networks, etc).
• Examples: C4.5rules

## Direct Method: Sequential Covering

1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met

More formally

Sequential-Covering(Target_attribute, Attributes, Examples, Threshold)

• Learned_rules <-- { }
• Rule <-- Learn-one-rule(Target_attribute, Attributes, Examples)
• While Performance(Rule, Examples) > Threshold :
• Learned_rules <-- Learned_rules + Rule
• Examples <-- Examples -{examples correctly classified by Rule}
• Rule <-- Learn-One-Rule (Target_attribute, Attributes, Examples)
• Learned_rules <-- sort Learned_rules according to Performance over Examples
• Return Learned_rules

However, because Sequntial Covering does not backtrack, this algorithm is not guaranteed to find the smallest or best set of rules,
thus theLearn-One-Rule function must be very effective.

The rectangular boundaries represent the simplistic conditional elements in the antecedent, which you can see sometimes include wrong classifications.

## Instance Elimination

Why do we need to eliminate instances?

• Otherwise, the next rule is identical to previous rule

Why do we remove positive instances?

• Ensure that the next rule is different

Why do we remove negative instances?

• Prevent underestimating accuracy of rule
• Compare rules R2 and R3 in the diagram

## Rule Growing

Top-down -- General-to-specific

Example on left shows a specific rule determining 3 cases, all classified as 'No'. Choice is picked because of the number of cases and accuracy.

Bottom-up -- Specific-to-general

Example on right has two rules (top) and the income part doesn't distinguish (85K or 90K) and thus the rule can be simplified.

### Rule Evaluation

How do we determine the best next rule.

FOIL’s Information Gain

[ FOIL: First Order Inductive Learner – an early rule-based learning algorithm]

R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct--a condition component in the antecedent)

Information Gain(R0, R1) = t *[ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]

where t= number of positive instances covered by both R0 and R1
p0= number of positive instances covered by R0
n0=number of negative instances covered by R0
p1=number of positive instances covered by R1
n1= number of negative instances covered by R1

Gain(R0,R1) is similar to the entropy gain calculations used in decision trees.

## Direct Method: RIPPER

Repeated Incremental Pruning to Produce Error Reduction (Weka->Classifier -> Rules -> JRip)

For 2-class problem, choose one of the classes as positive class, and the other as negative class.  You may choose the positive class as the smaller number of cases.

• Learn rules for positive class
• Negative class become the default class

Generalize for multi-class problem

• Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class)
• Learn the rule set for smallest class first, treat the rest as negative class
• Repeat with next smallest class as positive class
• Thus the largest class will become the default class

### Growing a rule:

• Start from empty rule
• Repeat: Add conjuncts as long as they improve FOIL’s information gain
• Stop when rule no longer covers negative examples
• We can get a rather extensive rule such as ABCD -> y
• Prune the rule immediately using incremental reduced error pruning
• Measure for pruning: v = (p-n)/(p+n)
• p: number of positive examples covered by the rule in the validation set
• n: number of negative examples covered by the rule in the validation set
• Pruning method: delete any final sequence of conditions that maximizes v
• Example: if the grown rule is ABCD -> y, check to prune D then CD, then BCD

### Building a Rule Set:

Uses the sequential covering algorithm

• Finds the best rule that covers the current set of positive examples
• Eliminate both positive and negative examples covered by the rule

Each time a rule is added to the rule set, compute the new description length

• Stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far (default setting for d=64 bits) [this is not clear from the text]
• Alternatively stop when the error rate exceeds 50%

Optimize the rule set:

• For each rule r in the rule set R
• Consider 2 alternative rules:
• Replacement rule (r*): grow new rule from scratch
• Revised rule(r′): add conjuncts to extend the rule r
• Compare the rule set for r against the rule set for r* and r′
• Choose rule set that minimizes MDL principle (minimum description length-- a measure of model complexity)
• Repeat rule generation and rule optimization for the remaining positive examples

## Indirect Methods

Convert from the decision tree:

## C4.5rules

(Weka: Classifiers -> Rules -> PART)

Extract rules from an unpruned decision tree

For each rule, r: A → y,

• consider an alternative rule r′: A′→ y where A′ is obtained by removing one of the conjuncts in A
• Compare the pessimistic error rate for r against all r’s
• Prune if one of the alternative rules has lower pessimistic error rate
• Repeat until we can no longer improve generalization error

Instead of ordering the rules, order subsets of rules (class ordering)

• Each subset is a collection of rules with the same rule consequent (class)
• Compute description length of each subset
• Description length = L(error) + g *L(model)
• g is a parameter that takes into account the presence of redundant attributes in a rule set (default value = 0.5)
• Similar to the generalization error calculation of a decision tree

## Example Comparison

### C4.5rules

(Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
( ) → Amphibians

RIPPER:

(Live in Water=Yes) → Fishes
(Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No) → Reptiles
(Can Fly=Yes, Give Birth=No) → Birds
() → Mammals

Confusion Matrix for C4.5 versus RIPPER

C4.5 accuracy = 16/20 = 80%

RIPPER accuracy = 12/20 = 60%

Has characteristics quite similar to decision trees

• As highly expressive as decision trees
• Easy to interpret
• Performance comparable to decision trees
• Can handle redundant attributes

Better suited for handling imbalanced classes

Harder to handle missing values in the test set

="../