DM 352 Syllabus  DM 552 Syllabus
last updated 6/23/21
There are many ways to represent values visually. We also need to recognize the differering types of values.
There are an abundance of terms from computer science, statistics and application areas that have similar meaning.
A data set is a collection of objects and their attributes
An attribute is a property or characteristic of an object
A collection of attributes describe an object
Attribute values are numbers or symbols assigned to an attribute for a particular object
Distinction between attributes and attribute values
Different attributes can be mapped to the same set of values
How you measure an attribute may not match the properties of the attribute.
You need to understand your data to be able to determine what type they are.
Discrete values, often nonnumeric
Properties: distinctness (= != are meaningful)
You further can characterize nominal attributes:
 categorical  a value selected from a finite, usually short, list of possibilities (colors, days of week); can be coded as an enumeration
 ranked  a categorical type with natural ordering (small, medium, large) so (= != < > are meaningful)
 artibrary  a value from an infinite range of possibilities with no implied ordering (addresses, names)
Examples include eye color, postal codes, id numbers, sex(malefemale)
Quantitative in nature; numeric codings may not necessarily be ordinal (quantitative). Further analysis.
Properties: distinctness and order, (= != < > are meaningful)
You further can characterize ordinal attributes:
 continuous, what are the upper and lower limits, and inclusive of the limits?
 binary, only the values 0 and 1 (true/false, yes/no, etc.)  can also consider this nominal/categorical
 discrete (integer or real), are the values separated by a constant value/interval?
 statistical (counts, means, medians, modes, standard deviations)  these arise from ordinal data.
Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height {3=tall, 2=medium, 1=short}, street numbers
Data values are separated by fixed amount(s).
Properties: distinctness, order and differences, (= != < > +  are meaningful)
Examples: calendar dates, temperatures in Celsius or Fahrenheit
Properties: distinctness, order, differences and ratios (= != < > +  * / are meaningful)
Examples: temperature in Kelvin, length, time, counts, mass
Is it physically meaningful to say that a temperature of 10 ° is twice that of 5° on
Consider measuring the height above average
Attribute Type  Transformation  Comments 

Nominal  Any permutation of values  If all employee ID numbers were reassigned, would it make any difference? 
Ordinal  An order preserving change of values, i.e.,
new_value = f(old_value) where f is a monotonic function 
An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. 
Interval  new_value = a * old_value + b where a and b are constants 
Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). 
Ratio  new_value = a * old_value  Length can be measured in meters or feet. 
Only presence (a nonzero attribute value) is regarded as important. Its absence is little information.
 Words present in documents
 Items present in customer transactions
If we met a friend in the grocery store would we ever say the following? “I see our purchases are very similar since we didn’t buy most of the same things.”
We need two asymmetric binary attributes to represent one ordinary binary attribute
 Association analysis uses asymmetric attributes
Asymmetric attributes typically arise from objects that are sets
Geometry or spatial, contains 2 or 3 values (lat, long, alt) that together may be treated as a single dimension. Some mathematical geometries can be considered: cartesian or spherical (GIS).
Timestamp or temporal, chronological types.
Topology or relationship connectivity.
Incompleteness
Real data is approximate and noisy
The types of operations you choose should be “meaningful” for the type of data you have
Data sets are organized typically as sets of records, where a record represents a data observation or data "point".
Records that have one value are univariate; having two values are bivariate; three are trivariate; more are hypervariate.
A value itself may have structure:
Record
Graph
Ordered
The number of values per record/observation is its dimension. Dimension should be consistent across all records of a data set.
A town center on a map has latitude, longitude, altitude, square miles, population, name, postal code, state, country as elements of its record. Its dimension, in this example, is 9.
A high dimension can pose challenges
Only presence counts.
And then there's the issue of missing data
Patterns depend on the scale.
Have data been already aggregated?
may also drive type of analysis
What is the dimension?
What are the types of the attributes?
If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute.
Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute.
A better example is rainfall data
Each document becomes a ‘term’ vector
A special type of record data, where
General directed graph, a molecule
Webpage connections
Genomic sequence
Average Monthly Temperature of land and ocean
General category  Specific type  Description  Examples  Coding 

Nominal "nonnumeric", discrete values 
Categorical  a value selected from a finite, usually short, list of possibilities  color, days of week  enumeration or arbitrary numbers; only equality tests are sensible 
Ranked  a categorical type with an implied ordering (can be converted to ordinal and ordinal can be converted to ranked nominal)  small, medium, large  numbers, according to the order  
Arbitrary  a value from an infinite range of possibilities with no implied ordering  addresses, names  no coding possible; only equality  
Binary  Boolean  two distinct categories  yes/no true/false  0/1 
Ordinal "numeric" interval?/ratio? 
Continuous  any real value between upper and lower limits  weights, lengths  typically a float variable type 
Discrete  values separated by a constant value (1, 10, 0.5)  counts  typically an integer variable type  
Statistical  values calculated from a set of ordinal values  counts, means, medians, modes, st.dev.  typically float, counts may be integer 

Spatial  Geographical  location on a map or plane or 3D space  longitude, latitude  pairs of values 
Temporal  Chronological  times, dates, numeric sequences  birthdates, daily, hourly observations  integers, time, float 
Topological  Connectivity (Relational)  relationship mappings  hierarchies, graphs, digraphs  foreign keys, crossreferencing values 