Data Quality and Preprocessing

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021


We consider some aspects of preprocessing here. Other preprocessing aspects will be covered later in other chapters.

When you have a data set, the raw data should be reviewed for problems.

For integrity and data mining, we must not alter data values to help make our case or a visualization more pleasing.

Truth needs to remain in the data.

On the other hand the quality, or lack thereof, of the data set has to be considered.

Data Quality

Poor data quality negatively affects many data processing efforts.

“The most important point is that poor data quality is an unfolding disaster. Poor data quality costs the typical company at least ten percent (10%) of revenue; twenty percent (20%) is probably a better estimate.”, Thomas C. Redman, DM Review, August 2004

Data mining example: a classification model for detecting people who are loan risks is built using poor data. The results may be that

What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:

The first two are considered in more detail below.

Noisy data

For objects, noise is considered an extraneous object.

For attributes, noise refers to modification of original values.


Origins of noise

Outliers--be careful!

Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

Case 1: Outliers are noise that interferes with data analysis

Case 2: Recognizing outliers can be the goal of our analysis

Causes for case 1?


Missing Data Handling

Many causes: malfunctioning equipment, changes in experimental design, collation of different data sources, measurement not possible. People may wish to not supply information. Information is not applicable (childen don't have annual income)

BUT...Missing (null) values may have significance in themselves (e.g. missing test in a medical examination, deathdate missing means still alive!)

Missing completely at random (MCAR)

Missing at Random (MAR)

Missing Not at Random (MNAR)

Not possible to know the situation from the data. You need to know the context, application field, data collection process, etc.

Inaccurate values

Issues and consideration

Duplicate Data

Data set may include data objects that are duplicates, or almost duplicates of one another

A major issue when merging data from multiple, heterogeneous sources

When should duplicate data not be removed?

We will address this further in the later sections on similarity and dissimilarity in the chapter.


Data Preprocessing

Aggregation - combining two or more attributes (or objects) into a single attribute (or object)

Sampling - the main technique employed for data set reduction (reduce number of rows)

Dimensionality Reduction - identify "important" variables

Feature subset selection - remove redundant or irrelevant attributes

Feature creation- new attributes that can capture the important information in a data set much more efficiently than the original attributes

Discretization and Binarization

Attribute Transformation - a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new value



Example: Australia precipitation standard deviation

The left histogram shows the standard deviation of average monthly precipitation for 3,030 0.5km by 0.5km grid cells in Australia. The right histogram shows the standard deviation of the average yearly precipitation for the same locations.

The average yearly precipitation has less variability than the average monthly precipitation


Processing the entire dataset may be too expensive or time consuming, or not possible due to memory size

Using a sample will work almost as well as using the entire data set, if the sample is representative.

A sample is representative if it has approximately the same properties (of interest) as the original set of data

Types of Sampling

Simple Random Sampling

There is an equal probability of selecting any particular item

Stratified sampling

Dimension Reduction

Curse of dimensionality



Feature Subset Selection

Another way to reduce dimensionality of data. You bring common sense to the analysis and preprocessing.

Redundant features

Irrelevant features

Feature Creation

Three general methodologies:

  1. Feature extraction
    Example: extracting edges from images
  2. Feature construction
    Example: dividing mass by volume to get density
  3. Mapping data to new space
    Example: Fourier and wavelet analysis



Data Normalization

Recalculating the values for better comparison

Ensure consistent units (monetary, measurements, temperature):

Other scalings

Change numeric values to fall within a specified range, such as scaling values to fall between 0 and 1, or -1 and 1.

This allows better comparisons or visualizations of attributes that are of different units.

Decimal scaling

Min-Max normalization

newValue =
originalValue - oldMin
( oldMax - oldMin )


Often the desired scale range is [0,1], so the formula becomes

newValue =
originalValue - oldMin

Z-score normalization

newValue =
originalValue - μ

Logarithmic normalization

When you have widely varying magnitudes, you may want changes in small numbers to not be lost in the mix of some large numbers in the attribute.

Apply logarithms of base b (b=2, e, or 10) to the values

Original Log base 10
1 0.00
5 0.70
2 0.30
15 1.18
30 1.48
4 0.60
150 2.18
48 1.68
360 2.56
1700 3.23
15000 4.18
3 0.48
50 1.70
60000 4.78
43211456 7.64

Data Type Conversion

Categorical/nominal data to numeric equivalent (coding)


Numeric to nominal (discretization)

Organize data into "bins" or ranges.

Three approaches

Numeric to nominal conversion
Original data:
2 3 4 5 6 7 9 10 11 15 16 20
Process Notes Bin 1 Bin 2 Bin 3
1. bins have even ranges

max-min =range

[2,8)={2,3,4,5,6,7} [8,14)={9,10,11} [14,20]={15,16,20}
2. bins have same number n / nBins
{2,3,4,5} {6,7,9,10} {11,15,16,20}
3. find natural gaps in the data some variation {2,3,4,5,6,7} {9,10,11} {15,16,20}

Numeric data that is continuous (real) may be processed by many tools as binary (low/high or yes/no).

Iris example of discretizing the measurements

How can we tell what the best discretization is?

Unsupervised discretization: find breaks in the data values

Supervised discretization: Use class labels to find breaks


Map a continuous or categorical attribute into one or more binary variables

Often convert a continuous attribute to a categorical attribute and then convert a categorical attribute to a set of binary attributes

Association analysis needs asymmetric binary attributes

Examples: eye color and height measured as {low, medium, high}