last updated 23-Jun-2021

## Considerations for Continuous and Categorical Attributes

How to apply association analysis to non-asymmetric binary variables? Example rule: {Gender=Male, Age ∈ [21,30)} → {No of hours online ≥ 10}

## Handling Categorical attributes

### example (Internet usage) Example rule: {Level of Education=Graduate, Online Banking=Yes} → {Privacy Concerns = Yes}

Above table is transformed as below to a column per category. Weka does it automatically. RapidMiner requires the use of the "Nominal to Binomial" process.

In the case of a larger number of categories, you might aggregate the lower support values as in below: ### Distribution of attribute values

can be highly skewed; rarely are the categories evenly distributed.

• Example: 85% of survey participants own a computer at home
• Most records have Computer at home = Yes so this may actually not very useful or interesting
• Computation becomes expensive; many frequent itemsets involving the binary item (Computer at home = Yes)
• Potential solution:
• discard the highly frequent items
• use alternative measures such as h-confidence [Hui Xiong, Dept CS, Univ Minn. http://datamining.rutgers.edu/talk/clique.pdf] ### Computational Complexity

• Binarizing the data increases the number of items,
• but the width of the “transactions” remain the same as the number of original (non-binarized) attributes.
• Binarizing produces more frequent itemsets but maximum size of frequent itemset is limited to the number of original attributes.

## Handling Continuous Attributes

Different methods:

• Discretization-based
• Statistics-based
• Non-discretization based: minApriori

Many different kinds of rules can be produced:

• {Age ∈ [21,30), No of hours online ∈ [10,20)}→ {Chat Online =Yes}
• {Age ∈ [21,30), Chat Online = Yes} → No of hours online: ∈ [4,14)}

## ### Unsupervised:

• Equal-width binning
• Equal-depth binning
• Cluster-based
 Original data: Process 1. equal width binning 2. equal depth-- data split by range bins have same number 2 3 4 5 6 7 9 10 11 15 16 20 Notes Bin 1 Bin 2 Bin 3 20-2=18 18/3=6 [2,8)={2,3,4,5,6,7} [8,14)={9,10,11} [14,20]={15,16,20} 12/3=4 {2,3,4,5} {6,7,9,10} {11,15,16,20} some variation {2,3,4,5,6,7} {9,10,11} {15,16,20}

### Supervised discretization ## Discretization Issues

### Interval width  Interval too wide (e.g., Bin size= 30)

• May merge several disparate patterns: Patterns A and B are merged together
• May lose some of the interesting patterns: Pattern C may not have enough confidence

Interval too narrow (e.g., Bin size = 2)

• Pattern A is broken up into two smaller patterns: Can recover the pattern by merging adjacent subpatterns
• Pattern B is broken up into smaller patterns: Cannot recover the pattern by merging adjacent subpatterns
• Some windows may not meet support threshold

Approach of considering all possible intervals • Number of intervals = k
• Total number of Adjacent intervals = k(k-1)/2

### Execution time

If the range is partitioned into k intervals, there are O(k2) new items

If an interval [a,b) is frequent, then all intervals that subsume [a,b) must also be frequent
E.g.: if {Age ∈ [21,25), Chat Online=Yes} is frequent, then {Age ∈ [10,50), Chat Online=Yes} is also frequent

Improve efficiency: Use maximum support to avoid intervals that are too wide

### Redundant rules

R1: {Age ∈ [18,20), Age ∈ [10,12)} → {Chat Online=Yes}
R2: {Age ∈ [18,23), Age∈ [10,20)} → {Chat Online=Yes}

If both rules have the same support and confidence, prune the more specific rule (R1)

## Statistics-based Association Rule Methods

Example: {Income > 100K, Online Banking=Yes} → Age: μ=34

Rule consequent consists of a continuous variable, characterized by their statistics : mean, median, standard deviation, etc.

Approach:

• Withhold the target attribute from the rest of the data
• Extract frequent itemsets from the rest of the attributes
• Binarized the continuous attributes (except for the target attribute)
• For each frequent itemset, compute the corresponding descriptive statistics of the target attribute
• Frequent itemset becomes a rule by introducing the target variable as rule consequent

Apply statistical test to determine interestingness of the rule

## Word-Document associations What measure can we use to evaluate the level of association of words (Wx) across the documents (Dx)?

E.g. Word1 and Word2 appears to be common.

Discretization is not effective here since ranges of words make no sense.

Normalize to frequency rates across the documents Redefine Support as ## Concept Hierarchies--Multilevel association rules

Items have a natural organization of groupings. More specific items may not have sufficient support but as a group there may be support.

The hierarchy is not necessarily a tree but a Directed Acyclic Graph. The examples below don't show the hierarchy as a DAG. Rules at lower levels may not have enough support to appear in any frequent itemsets

Rules at lower levels of the hierarchy are overly specific

• e.g., skim milk→ white bread,
skim milk → wheat bread, etc.
• all are indicative of association between milk and bread

Rules at higher level of hierarchy may be too generic, conversely.

How do support and confidence vary as we traverse the concept hierarchy?

If X is the parent item for both X1 and X2, then σ(X) ≤ σ(X1) + σ(X2)

If σ(X1 & Y1) ≥ minsup, and X is parent of X1, Y is parent of Y1 then any of the itemsets meet the threshold

• σ(X & Y1) ≥ minsup,
σ(X1 & Y) ≥ minsup
σ(X & Y) ≥ minsup

If conf(X1 → Y1) ≥ minconf, then conf(X1 → Y) ≥ minconf

### Approach 1:

Extend current association rule formulation by augmenting each transaction with higher level items

Original Transaction: {skim milk, wheat bread}

Issues:

• Items that reside at higher levels have much higher support counts
• if support threshold is low, too many frequent patterns involving items from the higher levels
• Increased dimensionality of the data

### Approach 2:

1. Generate frequent patterns at highest level first

2. Generate frequent patterns at the next highest level, and so on

Issues:

• I/O requirements will increase dramatically because we need to perform more passes over the data
• May miss some potentially interesting cross-level association patterns

## Sequential patterns and association rules

Sequential patterns: account for the order/timing of items purchased or events. ## Subgraph patterns 