Data Mining: A Closer Look

last updated 6/24/13

Output of Knowledge Representation through data mining--overview

independent variables<-->Input variables to the data mining process (time is always independent, you may designate others based on how the data was collected.

dependent variables<-->Output variables in the DM process

Hierarchy of strategies

I. Supervised Learning (uses differentiated independent and dependent variables)

  1. Classify
  2. Estimate/predict

II. Unsupervised Clustering (does not differentiate between independent and dependent)

III. Market Basket Analysis


Outputs

Most results from data mining can be visualized!!!!

linear regression modelLinear Regression model for supervised learning prediction

The output is represented as an equation as the sum of weighted input values

There are many tools and software applications to do complex linear, least squares regression.

In Weka load the cpu.arff data file from the Weka library to demonstrate linear regression.

 


Binary classification


binary classification


Trees for classification and predictionregression tree

Weka demonstration


Rules

Representing structural patterns

Representation determines inference method

Understanding the output is the key to understanding the underlying learning methods

Different types of output for different learning problems (e.g. classification, regression, …)


Classification (supervised) classification rule tree

Popular alternative to decision trees (to the one on the right)

Antecedent (pre-condition): a series of tests (just like the tests at the nodes of a decision tree)

Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule

This is the best understood of DM strategies

Results

All these deal with future behavior (e.g. river levels) or business decisions


Estimation (supervised)

Purpose is to determine a value for an output attribute:

The output is a numerical value or probability, usually over some continuous range

Modeling output variables are for immediate use and/or comparison with the real world (currently).

May use statistical linear regression modeling.

The estimate can typically be verified.

 


Prediction (supervised)

Similar to, or variations of, estimation and classification but with the intent of prediction the future further out.

The following table shows the sets of values that can be coded into the instances.

Cardiology Patient Data Types

Cardiology Domain, Most and least typical instances

Two rules generated:

IF 169 <= MaximumHeartRate <=202
THEN ConceptClass = Healthy

(Rule accuracy = 85.07%
Rule coverage = 34.55%)

IF Thal=Rev && ChestPainType=Asymptomatic
THEN ConceptClass = Sick

(Rule accuracy = 91.14%
Rule coverage = 52.17%)

We can identify relationship but not necessarily causality.

This is always a caution one must consider: don't conclude causality prematurely.


Unsupervised Clustering

These are situations where there is not a dependent variable to guide the process.

Instead we use clusters to build a knowledge structure. This may used to precede the development of a supervised model.

Example uses:

Outliers--instances where values of certain attributes appear to be extreme. When you plot these instances, they visibly stand apart.

Outliers are of interest to either exclude them (statisticians prefer to remove them as exceptions that unduly skew the statistical calculations). However, outliers may be instances of significant interest. Credit card fraud would be made up of outlier type of data.


Market Basket Analysis

Try to find interesting relationships among retail products.

Typically try to generate associations rules. What products are "related".

Use is to decide how products are placed on shelves, on web pages, what are adjacent, etc.

 

 

 

 

 

 


Rule Generation (supervised)

Example credit card promotional database:

Credit card promotional database

Rules generated for hypotheses are stated in terms of current rather than predicted behavior. Rules may be later used for classification or prediction, however.

IF sex=Female && 19<=Age<=43
THEN LifeInsurancePromo=Yes

(Accuracy=100%, coverage=66.7%)
IF sex=Male && IncomeRange=40-50K
THEN LifeInsurancePromo=No

(Accuracy=100%, coverage=50%)
IF CreditCardInsurance=Yes
THEN LifeInsurancePromo=Yes

(Accuracy=100%, coverage=33.3%)
IF WatchPromo=Yes
&& IncomeRange=40-50K
THEN LifeInsurancePromo=Yes

(Accuracy=100%, coverage=33.3%)


Association Rules

The purpose is to discover associations among the attributes, including relationships between output variables.

Inputs are generally discrete values: classifications or groupings (if numeric).
The boolean algebra makes evaluation simple. You could use linear regression, but we're interested in whether or not the instance is in an association.

Problem: immense number of possible associations

Output needs to be restricted to show only the most predictive associations, that is, only those with high support and high confidence

Examples:

  1. IF Sex=Female && Age=over40 && CreditCardIns=No
    THEN LifeInsurancePromo = Yes

  2. IF Sex=Male && Age=over40 && CreditCardIns=No
    THEN LifeInsurancePromo = Yes

  3. IF Sex=Female && Age=over40
    THEN CreditCardIns=No && LifeInsurancePromo = Yes

You may come up with rules with little meaning.

This is an apriori approach--that is, you need to set the groupings ahead of time.

Support and confidence of a rule

Support: number of instances predicted correctly

Confidence: number of correct predictions, as proportion of all instances that rule applies to
Example: 4 cool days with normal humidity

Support = 4, confidence = 100%
Normally: minimum support and confidence pre-specified (e.g. 58 rules with support >= 2 and confidence >= 95% for weather data)


Neural Networks

This is an artificial intelligence technique used for "learning" a model and then used for prediction.

Inputs are identified and must be numeric.

Neural Network Model

The neural network model implements the interconnected neural nodes of the human brain. Nodes "fire" if the inputs meet certain thresholds. Determining these thresholds are what requires a learning phase.

Phase 1 (learning):

Phase 2: use the network to compute new outputs for a new set of input.

Note that making sense of the weights generated within the network is hard due to the interconnectedness of the network.

Neural Network Training Results

Close to 0 = no, close to 1 = yes. What defines close? and what to do with values in the middle?


Statistical Regression

Linear regression model is commonly used to predict numeric outputs from a set of numeric inputs using a linear equation whose coefficients (weights) are determined by a training set of data.

One equation per output variable. The equations is composed of the sum of a series composed of an attribute times a factor (which can be positive or negative). In complex equations the attribute may be squared or cubed. The regression process determines the factors based on the data set.

Example equation: lifeInsurancePromo = 0.591(CreditCardIns) - 0.546 (sex) +0.773

Can use Excel or Minitab to generate these equations and analyze their appropriateness to the model.

We will look at linear regression more later.

 

 


Unsupervised Clustering

This is an exploratory data mining technique. You don't designate independent/input versus dependent/output variables. You are looking for relationships through clusterings of instances.

Apply some measure of similarity to divide instances into disjoint partitions.

What drives inclusion into a partition may be

Unsupervised clustering of the credit card db

Follow up with other techniques to explore relationships among attributes that may not have been considered before.


Evaluating Performance

  1. What are the benefits?
  2. What is the return on investment (ROI)?
  3. How do we interpret the results of the mining?
  4. What confidence level do we have on the results?

#1 and #2 are answered in the context of the business model for whom you work.

Evaluating supervised learner models

Classification correctness -- calculated by presenting previously unseen data in the form of a test set to the model. (i.e., keep some training data back and use it for the correctness testing)

Test set model accuracy can be summarized in a confusion matrix

A Three-class Confusion Matrix
  A B C
A
A correctly classified
A classified as B
A classified as C
B
B classified as A
B correctly classified
B classified as C
C
C classified as A
C classified as B
C correctly classified

Examples with 10% error rate:

Confusion Matrix
Model A Computed Accept Computed Reject
Accept
600
25
Reject
75
300
Confusion Matrix
Model B Computed Accept Computed Reject
Accept
500
75
Reject
25
400

 

Numeric output evaluation.

Again evaluate with test data: compute a mean squared error = r2. Minimizing r2 is at the heart of linear regression.

 


Lift

Measures the change in percent concentration of a desired class C from a set of classifications, taken from a biased sample relative to the concentration of C within the entire population. The more the bias, the better. Use a data mining model to determine the sample.

Lift =
P(concentration|sample)
P(concentration|population)

A lift chart shows the performance of a data mining model as a function of sample size

Lift Chart

Confusion matrix for a model
Model Y
Computed Accept
Computed Reject
Accept
450
550
Reject
19550
79450

Lift(Model Y) = (450/20000) / (1000/100000) = 2.25

 


Unsupervised Model Evaluation

More difficult to do.

Test sets can be used for evaluation then summarized in a confusion matrix.

2/3 of data used for training and remaining third for test data.