DM 352 Syllabus | DM 552 Syllabus
last updated 23-Jun-2021
[There is no corresponding reading from the text, but this material should be review. Please ask for further explanations on these topics that appear to be new to you.]
Key motivations of data exploration, also related to the area of Exploratory Data Analysis (EDA)
Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html,
from the statistician Douglas Fisher.
[Image of Iris Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute]
Three flower types (classes, nominal, categorical):
Four (non-class, quantitative, ordinal) attributes
Summary statistics are numbers that summarize properties of the data and can be calculated in a single pass through the data.
Summarized properties include frequency, location and spread
Examples for quantitative data:
The frequency of an attribute value is the percentage of time the value occurs in the data set
The mode of an attribute is the most frequent attribute value
The notions of frequency and mode are typically used with categorical data
For continuous data, the notion of a percentile is more useful.
Mean is most common measure of centrality but is sensitive to outliers.
The median or a trimmed mean (removal of outliers) is used
- m is the number of data items x
- r represents the middle of the sorted list of items x with indexing based at 1.
Range is the difference between the minimum and maximum = max(x) - min(x)
The variance (s_{x}^{2} ) or standard deviation (s_{x}) is the most common measure of the spread of a set of points.
Because of the influence of outliers, other measures are often used:
AAD = Average absolute deviation
MAD = Median absolute deviation
Easier to generate today on computers than earlier days of statistics (e.g., Tukey)
Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
Visualization of data is one of the most powerful and appealing techniques for data exploration.
Sea Surface Temperature for July 1982.
Thousands of data points shown in one image
is the mapping of data to a visual format
Data objects, their attributes, and the relationships among data objects are translated into the image elements such as points, lines, shapes, and colors.
Example:
- Objects are often represented as points
- Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape
- If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived.
Usually shows the distribution of values of a single variable
- Divide the values into bins and show a bar plot of the number of objects in each bin.
- The height of each bar indicates the number of objects
- Shape of histogram depends on the number of bins
Example: Iris Petal Width (10 and 20 bins, respectively)
Two dimensional histograms show the joint distribution of the values of two attributes.
Example: petal width vs petal length
Sometimes called Tukey Box Plots, attributed to the inventor. Shows distribution of the data.
Good for comparisons, e.g. the attributes of the iris dataset.
Attributes values determine the position
Two-dimensional scatter plots most common, but can have three-dimensional scatter plots
Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects
It is useful to have matrices of scatter plots that can compactly summarize the relationships of several pairs of attributes
Useful when a continuous attribute is measured on a spatial grid
They partition the plane into regions of similar values
The contour lines that form the boundaries of these regions connect points with equal values
The most common example is contour maps of elevation
Can also display temperature, rainfall, air pressure, etc.
Can plot the data matrix
This can be useful when objects are sorted according to class
Typically, the attributes are normalized to prevent one attribute from dominating the plot
Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects
Used to plot the attribute values of high-dimensional data
Instead of using perpendicular axes, use a set of parallel axes
The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line
Thus, each object is represented as a line
Often, the lines representing a distinct class of objects group together, at least for some attributes
Ordering of attributes is important in seeing such groupings (difference between the two plots below)
Similar approach to parallel coordinates, but axes radiate from a central point
The line connecting the values of an object is a polygon
Each plot is an example/observation.
Proposed by E. F. Codd, the father of the relational database.
Relational databases put data into tables (many tables, which are "normalized" to eliminate duplicate data for updating purposes).
OLAP uses a multidimensional matrix/array representation. Usually the data is static; the matrix is only added to.
Converting tabular data into a multidimensional array:
For the iris data
First, we discretized the petal width and length to have categorical values: low, medium, and high
An Excel Pivot Table will take the columnar data and treat it as a data cube.
Slice - selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions
Dice -selecting a subset of cells by specifying a range of attribute values
Roll up - Combine the data within a group of value (attributes may be hierarchical in structure) reducing the cube to fewer dimension
Drill down - expand the details of the cube into higher dimension because an attribute has internal structure. (Years have months)