Advanced Assoc. Rule Topics

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Considerations for Continuous and Categorical Attributes

How to apply association analysis to non-asymmetric binary variables?

Example rule: {Gender=Male, Age ∈ [21,30)} → {No of hours online ≥ 10}


Handling Categorical attributes

example (Internet usage)

Example rule: {Level of Education=Graduate, Online Banking=Yes} → {Privacy Concerns = Yes}

Above table is transformed as below to a column per category.

Weka does it automatically. RapidMiner requires the use of the "Nominal to Binomial" process.

In the case of a larger number of categories, you might aggregate the lower support values as in below:

Distribution of attribute values

can be highly skewed; rarely are the categories evenly distributed.

Computational Complexity


Handling Continuous Attributes

Different methods:

Many different kinds of rules can be produced:


Discretization Methods

Unsupervised:

Numeric to nominal conversion
Original data:
2 3 4 5 6 7 9 10 11 15 16 20
Process Notes Bin 1 Bin 2 Bin 3
1. equal width binning
data split by range

20-2=18
18/3=6

[2,8)={2,3,4,5,6,7} [8,14)={9,10,11} [14,20]={15,16,20}
2. equal depth--
bins have same number
12/3=4 {2,3,4,5} {6,7,9,10} {11,15,16,20}
3. cluster-based--
find natural gaps in the data
some variation {2,3,4,5,6,7} {9,10,11} {15,16,20}

 

Supervised discretization


Discretization Issues

Interval width

Interval too wide (e.g., Bin size= 30)

Interval too narrow (e.g., Bin size = 2)

Approach of considering all possible intervals

Execution time

If the range is partitioned into k intervals, there are O(k2) new items

If an interval [a,b) is frequent, then all intervals that subsume [a,b) must also be frequent
E.g.: if {Age ∈ [21,25), Chat Online=Yes} is frequent, then {Age ∈ [10,50), Chat Online=Yes} is also frequent

Improve efficiency: Use maximum support to avoid intervals that are too wide

 

Redundant rules

R1: {Age ∈ [18,20), Age ∈ [10,12)} → {Chat Online=Yes}
R2: {Age ∈ [18,23), Age∈ [10,20)} → {Chat Online=Yes}

If both rules have the same support and confidence, prune the more specific rule (R1)

 


Statistics-based Association Rule Methods

Example: {Income > 100K, Online Banking=Yes} → Age: μ=34

Rule consequent consists of a continuous variable, characterized by their statistics : mean, median, standard deviation, etc.

Approach:

Apply statistical test to determine interestingness of the rule

 


Word-Document associations

What measure can we use to evaluate the level of association of words (Wx) across the documents (Dx)?

E.g. Word1 and Word2 appears to be common.

Discretization is not effective here since ranges of words make no sense.

Normalize to frequency rates across the documents

Redefine Support as

 


Concept Hierarchies--Multilevel association rules

Items have a natural organization of groupings. More specific items may not have sufficient support but as a group there may be support.

The hierarchy is not necessarily a tree but a Directed Acyclic Graph. The examples below don't show the hierarchy as a DAG.

Rules at lower levels may not have enough support to appear in any frequent itemsets

Rules at lower levels of the hierarchy are overly specific

Rules at higher level of hierarchy may be too generic, conversely.

How do support and confidence vary as we traverse the concept hierarchy?

If X is the parent item for both X1 and X2, then σ(X) ≤ σ(X1) + σ(X2)

If σ(X1 & Y1) ≥ minsup, and X is parent of X1, Y is parent of Y1 then any of the itemsets meet the threshold

If conf(X1 → Y1) ≥ minconf, then conf(X1 → Y) ≥ minconf

Approach 1:

Extend current association rule formulation by augmenting each transaction with higher level items

Original Transaction: {skim milk, wheat bread}
Augmented Transaction: {skim milk, wheat bread, milk, bread, food}

Issues:

Approach 2:

1. Generate frequent patterns at highest level first

2. Generate frequent patterns at the next highest level, and so on

Issues:


Sequential patterns and association rules

Sequential patterns: account for the order/timing of items purchased or events.

 


Subgraph patterns