Sequential Patterns

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Introduction to Sequences

We are introducing or adding order to the items.

Sequence of different transactions by a customer at an online store (includes multiple items at the same time):

< {Digital Camera, iPad} {memory card} {headphone, iPad cover} >

Sequence of initiating events causing the nuclear accident at 3-mile Island (each event at a different time):

< {clogged resin} {outlet valve closure} {loss of feedwater}
{condenser polisher outlet valve shut} {booster pumps trip}
{main waterpump trips} {main turbine trips} {reactor pressure increases}>

Sequence of books checked out at a library by a patron:

<{Fellowship of the Ring} {The Two Towers} {Return of the King}>

In telecommunications alarm logs,

Inverter_Problem: (Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm)

In point-of-sale transaction sequences,

Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk)

Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket)

Sequence Data

Patterns that could be picked out: {2} -> {1} and {1,7,8} former as a rule and the latter as a simple association/itemset

Formal Definition of a Sequence

A sequence is an ordered list of elements: s = < e1 e2 e3 … >

Where each element contains a collection of events (items): ei = {i1, i2, …, ik}

Length of a sequence, |s|, is given by the number of elements in the sequence

A k-sequence is a sequence that contains k events (items)


A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (m ≥ n) if there exist integers
i1 < i2 < … < in such that a1 ∈ bi1 , a2∈ bi2, …, an ∈bin

Illustrative Example:
s:  b1 b2 b3 b4 b5
t:      a1 a2      a3

t is a subsequence of s if a1 ∈b2, a2∈ b3, a3∈ b5.

Data sequence



< {2,4} {3,5,6} {8} >

< {2} {8} >


< {1,2} {3,4} >

< {1} {2} >


< {2,4} {2,4} {2,5} >

< {2} {4} >


<{2,4} {2,5}, {4,5}>

< {2} {4} {5} >


<{2,4} {2,5}, {4,5}>

< {2} {5} {5} >


<{2,4} {2,5}, {4,5}>

< {2, 4, 5} >


The support of a subsequence(w) is defined as the fraction of data sequences that contain w

A sequential pattern is a frequent subsequence (i.e., a subsequence where support ≥ minsup)

Finding subsequences




Sequence Data vs. Market-basket Data

Extracting Sequential Patterns


Generalized Sequential Pattern (GSP)

Step 1:
Make the first pass over the sequence database D to yield all the 1-element frequent sequences

Step 2:
Repeat until no new frequent sequences are found

Candidate Generation

Base case (k=2):

Merging two frequent 1-sequences <{i1}> and <{i2}> will produce the following candidate 2-sequences:
<{i1} {i1}>, <{i1} {i2}>, <{i2} {i2}>, <{i2} {i1}> and <{i1 i2}>.

General case (k>2):

A frequent (k-1)-sequence w1 is merged with another frequent (k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by removing an event from the first element in w1 is the same as the subsequence obtained by removing an event from the last element in w2



Timing Constraints

Approach 1:

Approach 2:

Apriori Principle for Sequence Data

is undermined with varied max-gap


Contiguous subsequences

s is a contiguous subsequence of w = <e1>< e2>…< ek>
if any of the following conditions hold:

  1. s is obtained from w by deleting an item from either e1 or ek
  2. s is obtained from w by deleting an item from any element ei that contains at least 2 items
  3. s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of w (recursive definition)

Examples: s = < {1} {2} >

Modified Candidate Pruning

Timing Constraints II

This introduces a window size which is a time difference for an event across elements. Another combining of events option.


Other formulation

In some domains, we may have only one very long time series


Goal is to find frequent sequences of events in the time series. This problem is also known as frequent episode mining.

General Support Counting Schemes

COBJ -- One occurrence per object

CWIN -- One occurrence per sliding window

CMINWIN -- Number of minimal windows of occurrence (eliminates duplicate counting as in CWIN)

CDIST_O -- Distinct occurrences with possibility of event-timestamp overlap

CDIST -- Distinct occurrences with no event-timestamp overlap allowed.