DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

We are introducing or adding order to the items.

- There may be items grouped at the same time
- Actual time is not significant except for a partial ordering

Sequence of different transactions by a customer at an online store (includes multiple items at the same time):

< {Digital Camera, iPad} {memory card} {headphone, iPad cover} >

Sequence of initiating events causing the nuclear accident at 3-mile Island (each event at a different time):

(http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm)

< {clogged resin} {outlet valve closure} {loss of feedwater}

{condenser polisher outlet valve shut} {booster pumps trip}

{main waterpump trips} {main turbine trips} {reactor pressure increases}>

Sequence of books checked out at a library by a patron:

<{Fellowship of the Ring} {The Two Towers} {Return of the King}>

In telecommunications alarm logs,

Inverter_Problem:

(Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm)

In point-of-sale transaction sequences,

Computer Bookstore:

(Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk)Athletic Apparel Store:

(Shoes) (Racket, Racketball) --> (Sports_Jacket)

Patterns that could be picked out: {2} -> {1} and {1,7,8} former as a rule and the latter as a simple association/itemset

A sequence is an ordered list of elements:** s = < e _{1} e_{2} e_{3} … >**

Where each element contains a collection of events (items):

e_{i}= {i_{1}, i_{2}, …, i_{k}}

**Length** of a sequence, |s|, is given by the number of elements in the sequence

A * k-sequence* is a sequence that contains k events (items)

A sequence <a

_{1}a_{2}… a_{n}> is contained in another sequence <b_{1}b_{2}… b_{m}> (m ≥ n) if there exist integers

i_{1}< i_{2}< … < i_{n}such that a_{1}∈ b_{i1}, a_{2}∈ b_{i2}, …, a_{n}∈b_{in}Illustrative Example:

s: b_{1}b_{2}b_{3}b_{4}b_{5}

t: a_{1}a_{2}a_{3}t is a subsequence of s if a

_{1}∈b_{2}, a_{2}∈ b_{3}, a_{3}∈ b_{5}.

Data sequence |
Subsequence |
Contain? |
---|---|---|

< {2,4} {3,5,6} {8} > |
< {2} {8} > |
Yes |

< {1,2} {3,4} > |
< {1} {2} > |
No |

< {2,4} {2,4} {2,5} > |
< {2} {4} > |
Yes |

<{2,4} {2,5}, {4,5}> |
< {2} {4} {5} > |
No |

<{2,4} {2,5}, {4,5}> |
< {2} {5} {5} > |
Yes |

<{2,4} {2,5}, {4,5}> |
< {2, 4, 5} > |
No |

The * support of a subsequence*(w) is defined as the fraction of data sequences that contain w

A * sequential pattern *is a frequent subsequence (i.e., a subsequence where support ≥

Given:

- a database of sequences
- a user-specified minimum support threshold, minsup
Task:

- Find all subsequences with support ≥ minsup

Example

Step 1:

Make the first pass over the sequence database D to yield all the 1-element frequent sequences

Step 2:

Repeat until no new frequent sequences are found

- Candidate Generation:
- Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items

- Candidate Pruning:
- Prune candidate k-sequences that contain infrequent (k-1)-subsequences

- Support Counting:
- Make a new pass over the sequence database D to find the support for these candidate sequences

- Candidate Elimination:
- Eliminate candidate k-sequences whose actual support is less than minsup

Base case (k=2):

Merging two frequent 1-sequences <{i

_{1}}> and <{i_{2}}> will produce the following candidate 2-sequences:

<{i_{1}} {i_{1}}>, <{i_{1}} {i_{2}}>, <{i_{2}} {i_{2}}>, <{i_{2}} {i_{1}}> and <{i_{1}i_{2}}>.

General case (k>2):

A frequent (k-1)-sequence w

_{1}is merged with another frequent (k-1)-sequence w_{2}to produce a candidate k-sequence if the subsequence obtained by removing an event from the first element in w_{1}is the same as the subsequence obtained by removing an event from the last element in w_{2}

- The resulting candidate after merging is given by extending the sequence w
_{1}as follows-

- If the last element of w
_{2}has only one event, append it to w_{1}- Otherwise add the event from the last element of w
_{2}(which is absent in the last element of w_{1}) to the last element of w_{1}

Examples

- Merging w
_{1}=<{1 2 3} {4 6}> and w_{2}=<{2 3} {4 6} {5}> produces the candidate sequence < {1 2 3} {4 6} {5}> because the last element of w_{2}has only one event - Merging w
_{1}=<{1} {2 3} {4}> and w_{2}=<{2 3} {4 5}> produces the candidate sequence < {1} {2 3} {4 5}> because the last element in w_{2}has more than one event - Merging w
_{1}=<{1 2 3} > and w_{2}=<{2 3 4} > produces the candidate sequence < {1 2 3 4}> because the last element in w_{2}has more than one event - We do not have to merge the sequences
w
_{1}=<{1} {2 6} {4}> and w_{2}=<{1} {2} {4 5}> to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viable candidate, then it can be obtained by merging w_{1}with < {2 6} {4 5}>

Approach 1:

- Mine sequential patterns without timing constraints
- Postprocess the discovered patterns

Approach 2:

- Modify GSP to directly prune candidates that violate timing constraints

is undermined with varied max-gap

s is a contiguous subsequence of
w = <e_{1}>< e_{2}>…< e_{k}>

if any of the following conditions hold:

- s is obtained from w by deleting an item from either e
_{1}or e_{k} - s is obtained from w by deleting an item from any element ei that contains at least 2 items
- s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of w (recursive definition)

Examples: s = < {1} {2} >

- is a contiguous subsequence of

< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} > - but is not a contiguous subsequence of

< {1} {3} {2}> and < {2} {1} {3} {2}>

Modified Candidate Pruning

- Without maxgap constraint: A candidate k-sequence is pruned if at least one of its (k-1)-subsequences is infrequent
- With maxgap constraint: A candidate k-sequence is pruned if at least one of its contiguous (k-1)-subsequences is infrequent

This introduces a window size which is a time difference for an event across elements. Another combining of events option.

In some domains, we may have only one very long time series

Example:

- monitoring network traffic events for attacks
- monitoring telecommunication alarm signals

Goal is to find frequent sequences of events in the time series.
This problem is also known as **frequent episode mining**.

COBJ -- One occurrence per object

CWIN -- One occurrence per sliding window

CMINWIN -- Number of minimal windows of occurrence (eliminates duplicate counting as in CWIN)

CDIST_O -- Distinct occurrences with possibility of event-timestamp overlap

CDIST -- Distinct occurrences with no event-timestamp overlap allowed.