Data Quality and Preprocessing
DM 352 Syllabus  DM 552 Syllabus
Preprocessing
We consider some aspects of preprocessing here. Other preprocessing aspects will be covered later in other chapters.
When you have a data set, the raw data should be reviewed for problems.
For integrity and data mining, we must not alter data values to help make our case or a visualization more pleasing.
Truth needs to remain in the data.
On the other hand the quality, or lack thereof, of the data set has to be considered.
Data Quality
Poor data quality negatively affects many data processing efforts.
“The most important point is that poor data quality is an unfolding disaster.
Poor data quality costs the typical company at least ten percent (10%) of revenue; twenty percent (20%) is probably a better estimate.”,
Thomas C. Redman, DM Review, August 2004
Data mining example: a classification model for detecting people who are loan risks is built using poor data. The results may be that
 some creditworthy candidates are denied loans
 more loans are given to individuals that default
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
 Noise and outliers
 Missing values
 Duplicate data
 Wrong data
The first two are considered in more detail below.
Noisy data
For objects, noise is considered an extraneous object.
For attributes, noise refers to modification of original values.
 Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
 We can talk about signal to noise ratio.
Left image of 2 sine waves has low or zero SNR; the right image are the two waves combined with noise and has high SNR
Origins of noise
 outliers  values seemingly out of the normal range of data
 duplicate records  good database design should minimize this (use DISTINCT on SQL retrievals)
 incorrect attribute values  again good db design and integrity constraints should minimize this
 numeric only, deal with rogue strings or characters where numbers should be.
 null handling for attributes (nulls=missing values)
Outliersbe careful!
Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
Case 1: Outliers are
noise that interferes
with data analysis
Case 2: Recognizing outliers can be
the goal of our analysis
 Credit card fraud
 Intrusion detection
Causes for case 1?
Missing Data Handling
Many causes: malfunctioning equipment, changes in experimental design, collation of different data sources, measurement not possible. People may wish to not supply information. Information is not applicable (childen don't have annual income)
 Discard records with missing values
 Ordinalcontinuous data, could replace with attribute means
 Substitute with a value from a similar instance
 Ignore missing values, i.e., just proceed and let the tools deals with them
 Treat missing values as equals (all share the same missing value code)
 Treat missing values as unequal values
BUT...Missing (null) values may have significance in themselves (e.g. missing test in a medical examination, deathdate missing means still alive!)
Missing completely at random (MCAR)
 Missingness of a value is independent of attributes
 Fill in values based on the attribute as suggested above (e.g. attribute mean)
 Analysis may be unbiased overall
Missing at Random (MAR)
 Missingness is related to other variables
 Fill in values based other values (e.g., from similar instances)
 Almost always produces a bias in the analysis
Missing Not at Random (MNAR)
 Missingness is related to unobserved measurements
 Informative or nonignorable missingness
Not possible to know the situation from the data. You need to know the context, application field, data collection process, etc.
Inaccurate values
Issues and consideration
 Data may not been collected for mining purposes
 Errors and omissions don't affect original purpose of data (e.g. age of customer)
 Typographical errors in nominal attributes => values need to be checked for consistency
 Typographical and measurement errors in numeric attributes => outliers need to be identified
 Errors may be deliberate (e.g. wrong postal codes, birthdates)
 Other problems: duplicates, stale data
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
A major issue when merging data from multiple, heterogeneous sources
 Examples:
Same person with multiple email addresses
When should duplicate data not be removed?
We will address this further in the later sections on similarity and dissimilarity in the chapter.
Data Preprocessing
Aggregation  combining two or more attributes (or objects) into a single attribute (or object)
Sampling  the main technique employed for data set reduction (reduce number of rows)
Dimensionality Reduction  identify "important" variables
Feature subset selection  remove redundant or irrelevant attributes
Feature creation new attributes that can capture the important information in a data set much more efficiently than the original attributes
Discretization and Binarization
Attribute Transformation  a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new value
Aggregation
Purpose
 Data reduction
 Reduce the number of attributes or objects
 Results in simpler models
 Faster computation of the models
 Change of scale
 Cities aggregated into regions, states, countries, etc.
 Days aggregated into weeks, months, or years
 More “stable” data
 Aggregated data tends to have less variability
Example: Australia precipitation standard deviation
The left histogram shows the standard deviation of average monthly precipitation for 3,030 0.5km by 0.5km grid cells in Australia. The right histogram shows the standard deviation of the average yearly precipitation for the same locations.
The average yearly precipitation has less variability than the average monthly precipitation
Sampling
Processing the entire dataset may be too expensive or time consuming, or not possible due to memory size
Using a sample will work almost as well as using the entire data set, if the sample is representative.
A sample is representative if it has approximately the same properties (of interest) as the original set of data
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
 Sampling without replacement:
As each item is selected, it is removed from the population
 Sampling with replacement: Objects are not removed from the population as they are selected for the sample.
 In sampling with replacement, the same object can be picked up more than once
Stratified sampling
 Split the data into several partitions; then draw random samples from each partition
Dimension Reduction
Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse in the space that it occupies
 Definitions of density and distance between points, which are critical for clustering and outlier detection, become less meaningful
Purpose:
 Avoid curse of dimensionality
 Reduce amount of time and memory required by data mining algorithms
 Allow data to be more easily visualized
 May help to eliminate irrelevant features or reduce noise
 Identify highly corrrelated features  the correlated features do not bring information to the model
Techniques
 Principal Components Analysis (PCA) find a projection that captures the largest amount of variation in data
 Singular Value Decomposition (SVD)
 Others: supervised and nonlinear techniques
Feature Subset Selection
Another way to reduce dimensionality of data. You bring common sense to the analysis and preprocessing.
Redundant features
 Duplicate much or all of the information contained in one or more other attributes
 Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
 Contain no information that is useful for the data mining task at hand
 Example: students' ID is often irrelevant to the task of predicting students' GPA
Feature Creation
Three general methodologies:
 Feature extraction
Example: extracting edges from images
 Feature construction
Example: dividing mass by volume to get density
 Mapping data to new space
Example: Fourier and wavelet analysis
Data Normalization
Recalculating the values for better comparison
Ensure consistent units (monetary, measurements, temperature):
 Metric, British, American weights, lengths
 currencyuse common unit (Euro, USD)
 currency adjusted for inflationvalue of money is not the same as 10 years ago
Other scalings
Change numeric values to fall within a specified range, such as scaling values to fall between 0 and 1, or 1 and 1.
This allows better comparisons or visualizations of attributes that are of different units.
Decimal scaling
 divide by a constant that brings all values into the acceptable range
 e.g. if we have a range of values 40 to 120, then dividing by 120, we'll have values between 1 and 1.
or add 40 and divide by 80 to map 40 to 1
and 120 to 1.
MinMax normalization
 when you know the limits (minimum and maximum) of the original values, you can transform or scale them to another range [newMin, newMax]
 general formula is
newValue = 
originalValue  oldMin
( oldMax  oldMin ) 
*(newMaxnewMin)+newMin 
Often the desired scale range is [0,1], so the formula becomes
newValue = 
originalValue  oldMin
oldMaxoldMin 
Zscore normalization
 used when minimums and maximums are not known (i.e., expect data in the future but want consistency)
 use the mean of the values, μ and the standard deviation σ
 all values won't be between 1 and 1 but most will be
 the average of the scaled values should be near 0
newValue = 
originalValue  μ
σ 
Logarithmic normalization
When you have widely varying magnitudes, you may want changes in small numbers to not be lost in the mix of some large numbers in the attribute.
Apply logarithms of base b (b=2, e, or 10) to the values
 requires original values to be positive because the log of a negative does not exist. Also note that original values between 0 and 1 are going to be negative in the normalization.
Original 
Log base 10 
1 
0.00 
5 
0.70 
2 
0.30 
15 
1.18 
30 
1.48 
4 
0.60 
150 
2.18 
48 
1.68 
360 
2.56 
1700 
3.23 
15000 
4.18 
3 
0.48 
50 
1.70 
60000 
4.78 
43211456 
7.64 
Data Type Conversion
Categorical/nominal data to numeric equivalent (coding)
 be sure that when you do this, there is meaning to values assigned
 do the numeric orders make sense? Are the nominal data ordered?
 test if an average of theses numbers makes sense
 e.g., Red=0, Green=1, Blue=2, Yellow=3, Black=4
 e.g., Likert Scales: No = 1, Don't Care = 0, Yes = 1
Numeric to nominal (discretization)
Organize data into "bins" or ranges.
 A potentially infinite number of values are mapped into a small number of categories
 Discretization is commonly used in classification
 Many classification algorithms work best if both the independent and dependent variables have only a few values
Three approaches
Numeric to nominal conversion
Original data: 
2 3 4 5 6 7 9 10 11 15 16 20

Process 
Notes 
Bin 1 
Bin 2 
Bin 3 
1. bins have even ranges 
maxmin =range
202=18
range/bins=evenRange
18/3=6 
[2,8)={2,3,4,5,6,7} 
[8,14)={9,10,11} 
[14,20]={15,16,20} 
2. bins have same number 
n / nBins
12/3=4 
{2,3,4,5} 
{6,7,9,10} 
{11,15,16,20} 
3. find natural gaps in the data 
some variation 
{2,3,4,5,6,7} 
{9,10,11} 
{15,16,20} 
Numeric data that is continuous (real) may be processed by many tools as binary (low/high or yes/no).
 A mean or median is found and then the relation of < or > is used to split the values into two groups.
 This is where you may want to transform into a larger number of categories and/or normalized.
Iris example of discretizing the measurements
How can we tell what the best discretization is?
Unsupervised discretization: find breaks in the data values
Supervised discretization: Use class labels to find breaks
Binarization
Map a continuous or categorical attribute into one or more binary variables
Often convert a continuous attribute to a categorical attribute and then convert a categorical attribute to a set of binary attributes
Association analysis needs asymmetric binary attributes
Examples: eye color and height measured as
{low, medium, high}