# Data Mining: A Closer Look

last updated 6/24/13

## Output of Knowledge Representation through data mining--overview

independent variables<-->Input variables to the data mining process (time is always independent, you may designate others based on how the data was collected.

dependent variables<-->Output variables in the DM process

### Hierarchy of strategies

I. Supervised Learning (uses differentiated independent and dependent variables)

1. Classify
2. Estimate/predict

II. Unsupervised Clustering (does not differentiate between independent and dependent)

## Outputs

### Linear Regression model for supervised learning prediction

• Inputs and outputs are all numeric

The output is represented as an equation as the sum of weighted input values

• The trick is to find good values for the weights in the equation
• PRP = 37.06 + 2.47*CACH

There are many tools and software applications to do complex linear, least squares regression.

In Weka load the cpu.arff data file from the Weka library to demonstrate linear regression.

• remove all attributes except CACH and class
• Click Choose under Classifier -> Functions ->LinearRegression
• Choose class as the output variable above the Start button
• Click Start
• It's not exactly the same equation but similar

## Binary classification

• A line separates the two classes using a decision boundary which defines where the decision changes from one class value to the other
• Predictions are made by plugging in observed values of the attributes into the expression
Predict one class if output >= 0, and the other class if output < 0
• The boundary becomes a high-dimensional plane (hyperplane) when there are multiple attributes
• Example: Separating setosas irises from versicolors
2.0–0.5*PETAL-LENGTH–0.8*PETAL-WIDTH = 0 is the equation of the boundary line
• Not able to find this process in Weka

## Trees for classification and prediction

• "Divide-and-conquer" approach produces tree
• Nodes involve testing a particular attribute
• Usually, attribute value is compared to constant
• Other possibilities:
• Comparing values of two attributes
• Using a function of one or more attributes
• Leaves assign classification, set of classifications, or probability distribution to instances
• Unknown instance is routed down the tree
• Regression tree is a “decision tree” where each leaf predicts a numeric quantity

Weka demonstration

• Open cpu.arff from the Weka library
• Classify tab-> click Choose button -> Trees -> M5P
• Click Start
• Right click results list and choose Visualize Tree

## Rules

• Classification rules
• Association rules
• Rules with exceptions
• More expressive rules
• Instance-based representation
• Clusters

### Representing structural patterns

• Decision trees, rules, instance-based, …

Representation determines inference method

Understanding the output is the key to understanding the underlying learning methods

Different types of output for different learning problems (e.g. classification, regression, …)

## Classification (supervised)

Popular alternative to decision trees (to the one on the right)

Antecedent (pre-condition): a series of tests (just like the tests at the nodes of a decision tree)

• Tests are usually logically ANDed together (but may also be general logical expressions)

Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule

• Individual rules are often logically ORed together

This is the best understood of DM strategies

• learning is supervised
• dependent variables are categorized into one of discrete groups
• emphasis is on building models that are able to assign new instances to one of a set of well-defined classes.

Results

• Determine characteristic differences
• Determine profiles of people, events
• Determine risks

All these deal with future behavior (e.g. river levels) or business decisions

## Estimation (supervised)

Purpose is to determine a value for an output attribute:

The output is a numerical value or probability, usually over some continuous range

Modeling output variables are for immediate use and/or comparison with the real world (currently).

May use statistical linear regression modeling.

The estimate can typically be verified.

## Prediction (supervised)

Similar to, or variations of, estimation and classification but with the intent of prediction the future further out.

• next week's Dow Jones Average
• car sales trend up or down in three months

The following table shows the sets of values that can be coded into the instances.

Two rules generated:

 IF 169 <= MaximumHeartRate <=202 THEN ConceptClass = Healthy (Rule accuracy = 85.07% Rule coverage = 34.55%) IF Thal=Rev && ChestPainType=Asymptomatic THEN ConceptClass = Sick (Rule accuracy = 91.14% Rule coverage = 52.17%)

We can identify relationship but not necessarily causality.

This is always a caution one must consider: don't conclude causality prematurely.

## Unsupervised Clustering

These are situations where there is not a dependent variable to guide the process.

Instead we use clusters to build a knowledge structure. This may used to precede the development of a supervised model.

Example uses:

• determine if meaningful relationship in the form of concepts can be found in the data
• determine a best set of input attributes for supervised learning
• detect outliers
• evaluate the likely performance of a supervised learner model

Outliers--instances where values of certain attributes appear to be extreme. When you plot these instances, they visibly stand apart.

Outliers are of interest to either exclude them (statisticians prefer to remove them as exceptions that unduly skew the statistical calculations). However, outliers may be instances of significant interest. Credit card fraud would be made up of outlier type of data.

Try to find interesting relationships among retail products.

Typically try to generate associations rules. What products are "related".

Use is to decide how products are placed on shelves, on web pages, what are adjacent, etc.

## Rule Generation (supervised)

Example credit card promotional database:

Rules generated for hypotheses are stated in terms of current rather than predicted behavior. Rules may be later used for classification or prediction, however.

 IF sex=Female && 19<=Age<=43 THEN LifeInsurancePromo=Yes (Accuracy=100%, coverage=66.7%) IF sex=Male && IncomeRange=40-50K THEN LifeInsurancePromo=No (Accuracy=100%, coverage=50%) IF CreditCardInsurance=Yes THEN LifeInsurancePromo=Yes (Accuracy=100%, coverage=33.3%) IF WatchPromo=Yes && IncomeRange=40-50K THEN LifeInsurancePromo=Yes (Accuracy=100%, coverage=33.3%)

## Association Rules

The purpose is to discover associations among the attributes, including relationships between output variables.

Inputs are generally discrete values: classifications or groupings (if numeric).
The boolean algebra makes evaluation simple. You could use linear regression, but we're interested in whether or not the instance is in an association.

• can predict any attribute and combinations of attributes
• are not intended to be used together as a set

Problem: immense number of possible associations

Output needs to be restricted to show only the most predictive associations, that is, only those with high support and high confidence

Examples:

1. IF Sex=Female && Age=over40 && CreditCardIns=No
THEN LifeInsurancePromo = Yes

2. IF Sex=Male && Age=over40 && CreditCardIns=No
THEN LifeInsurancePromo = Yes

3. IF Sex=Female && Age=over40
THEN CreditCardIns=No && LifeInsurancePromo = Yes

You may come up with rules with little meaning.

This is an apriori approach--that is, you need to set the groupings ahead of time.

### Support and confidence of a rule

Support: number of instances predicted correctly

Confidence: number of correct predictions, as proportion of all instances that rule applies to
Example: 4 cool days with normal humidity

Support = 4, confidence = 100%
Normally: minimum support and confidence pre-specified (e.g. 58 rules with support >= 2 and confidence >= 95% for weather data)

## Neural Networks

This is an artificial intelligence technique used for "learning" a model and then used for prediction.

Inputs are identified and must be numeric.

The neural network model implements the interconnected neural nodes of the human brain. Nodes "fire" if the inputs meet certain thresholds. Determining these thresholds are what requires a learning phase.

Phase 1 (learning):

• each input is associated with an input layer node
• weights are associated in the hidden layer
• outputs are compared to the training set of output and changes to the weights are propogated back through the hidden layer
• Training continues iteratively until the outputs converge to a minimum error rate

Phase 2: use the network to compute new outputs for a new set of input.

Note that making sense of the weights generated within the network is hard due to the interconnectedness of the network.

Close to 0 = no, close to 1 = yes. What defines close? and what to do with values in the middle?

## Statistical Regression

Linear regression model is commonly used to predict numeric outputs from a set of numeric inputs using a linear equation whose coefficients (weights) are determined by a training set of data.

One equation per output variable. The equations is composed of the sum of a series composed of an attribute times a factor (which can be positive or negative). In complex equations the attribute may be squared or cubed. The regression process determines the factors based on the data set.

Example equation: lifeInsurancePromo = 0.591(CreditCardIns) - 0.546 (sex) +0.773

Can use Excel or Minitab to generate these equations and analyze their appropriateness to the model.

We will look at linear regression more later.

## Unsupervised Clustering

This is an exploratory data mining technique. You don't designate independent/input versus dependent/output variables. You are looking for relationships through clusterings of instances.

Apply some measure of similarity to divide instances into disjoint partitions.

What drives inclusion into a partition may be

• nearness to a group average
• hierarchical discovery

Follow up with other techniques to explore relationships among attributes that may not have been considered before.

## Evaluating Performance

1. What are the benefits?
2. What is the return on investment (ROI)?
3. How do we interpret the results of the mining?
4. What confidence level do we have on the results?

#1 and #2 are answered in the context of the business model for whom you work.

### Evaluating supervised learner models

Classification correctness -- calculated by presenting previously unseen data in the form of a test set to the model. (i.e., keep some training data back and use it for the correctness testing)

Test set model accuracy can be summarized in a confusion matrix

A Three-class Confusion Matrix
A B C
A
A correctly classified
A classified as B
A classified as C
B
B classified as A
B correctly classified
B classified as C
C
C classified as A
C classified as B
C correctly classified

Examples with 10% error rate:

Confusion Matrix
Model A Computed Accept Computed Reject
Accept
600
25
Reject
75
300
Confusion Matrix
Model B Computed Accept Computed Reject
Accept
500
75
Reject
25
400

Numeric output evaluation.

Again evaluate with test data: compute a mean squared error = r2. Minimizing r2 is at the heart of linear regression.

## Lift

Measures the change in percent concentration of a desired class C from a set of classifications, taken from a biased sample relative to the concentration of C within the entire population. The more the bias, the better. Use a data mining model to determine the sample.

 Lift = P(concentration|sample) P(concentration|population)

A lift chart shows the performance of a data mining model as a function of sample size

Confusion matrix for a model
Model Y
Computed Accept
Computed Reject
Accept
450
550
Reject
19550
79450

Lift(Model Y) = (450/20000) / (1000/100000) = 2.25

## Unsupervised Model Evaluation

More difficult to do.

Test sets can be used for evaluation then summarized in a confusion matrix.

2/3 of data used for training and remaining third for test data.