Data Exploration Review

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

[There is no corresponding reading from the text, but this material should be review. Please ask for further explanations on these topics that appear to be new to you.]

Introduction

Key motivations of data exploration, also related to the area of Exploratory Data Analysis (EDA)


Classical Running example for data mining (Iris dataset)

Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html,
from the statistician Douglas Fisher.

[Image of Iris Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute]

Three flower types (classes, nominal, categorical):

Four (non-class, quantitative, ordinal) attributes


Summary Statistics

Summary statistics are numbers that summarize properties of the data and can be calculated in a single pass through the data.

Summarized properties include frequency, location and spread

Examples for quantitative data:

The frequency of an attribute value is the percentage of time the value occurs in the data set

The mode of an attribute is the most frequent attribute value

The notions of frequency and mode are typically used with categorical data

For continuous data, the notion of a percentile is more useful.

Location measures of mean and median

Mean is most common measure of centrality but is sensitive to outliers.

The median or a trimmed mean (removal of outliers) is used

 

Spread measures of range and variance

Range is the difference between the minimum and maximum = max(x) - min(x)

The variance (sx2 ) or standard deviation (sx) is the most common measure of the spread of a set of points.

Because of the influence  of outliers, other measures are often used:

AAD = Average absolute deviation
MAD = Median absolute deviation


Visualization

Easier to generate today on computers than earlier days of statistics (e.g., Tukey)

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

Visualization of data is one of the most powerful and appealing techniques for data exploration.

Sea Surface Temperature for July 1982.

Thousands of data points shown in one image

Visual Representation

is the mapping of data to a visual format

Data objects, their attributes, and the relationships among data objects are translated into the image elements such as points, lines, shapes, and colors.

Example:

Histograms

Usually shows the distribution of values of a single variable

Example: Iris Petal Width (10 and 20 bins, respectively)

Two dimensional histograms show the joint distribution of the values of two attributes.

Example: petal width vs petal length

Box Plots

Sometimes called Tukey Box Plots, attributed to the inventor. Shows distribution of the data.

Good for comparisons, e.g. the attributes of the iris dataset.

Scatter plots

Attributes values determine the position

Two-dimensional scatter plots most common, but can have three-dimensional scatter plots

Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects

It is useful to have matrices of scatter plots that can compactly summarize the relationships of several pairs of attributes

 

 

 

Contour plots

Useful when a continuous attribute is measured on a spatial grid

They partition the plane into regions of similar values

The contour lines that form the boundaries of these regions connect points with equal values

The most common example is contour maps of elevation

Can also display temperature, rainfall, air pressure, etc.

 

Matrix plots

Can plot the data matrix

This can be useful when objects are sorted according to class

Typically, the attributes are normalized to prevent one attribute from dominating the plot

Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects

 

Parallel Coordinates

Used to plot the attribute values of high-dimensional data

Instead of using perpendicular axes, use a set of parallel axes

The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line

Thus, each object is represented as a line

Often, the lines representing a distinct class of objects group together, at least for some attributes

Ordering of attributes is important in seeing such groupings (difference between the two plots below)

Star Plots

Similar approach to parallel coordinates, but axes radiate from a central point

The line connecting the values of an object is a polygon

Each plot is an example/observation.


On-Line Analytical Processing (OLAP)

Proposed by E. F. Codd, the father of the relational database.

Relational databases put data into tables (many tables, which are "normalized" to eliminate duplicate data for updating purposes).

OLAP uses a multidimensional matrix/array representation. Usually the data is static; the matrix is only added to.

Converting tabular data into a multidimensional array:

For the iris data

First, we discretized the petal width and length to have categorical values: low, medium, and high

An Excel Pivot Table will take the columnar data and treat it as a data cube.

Data cube operations

Slice - selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions

Dice -selecting a subset of cells by specifying a range of attribute values

Roll up - Combine the data within a group of value (attributes may be hierarchical in structure) reducing the cube to fewer dimension

Drill down - expand the details of the cube into higher dimension because an attribute has internal structure. (Years have months)