Introduction

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Course context

Course title "Machine Learning" vs text title "Data Mining"

We will not concern ourselves with any particular differentiations.

The authors, all 4 are male, mixture of ethnic backgrounds, presumed-- acknowledging the field has no balanced representation. 

Are there biases in the book?

Prerequisites:

Why Data Mining?

There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies

Cheap digital storage options with immense processing power

New mantra

Expectations

Commercial viewpoint: Lots of data is being collected and warehoused

Competitive pressure is strong

Scientific viewpoint: Data collected in realtime and stored at enormous speeds

Data mining helps scientists

 

Great opportunities to improve productivity in all walks of life

The V's of Big Data

Great Opportunities to Solve Society’s Major Problems

One final through about big data

The growth of the data we collect may soon become too unmanageable and impossible to store it all.

Just a reminder of magnitudes

TeraBytes = trillion

PetaBytes = million billion

ExaBytes = million trillion (bill. bill.)

ZettaBytes = billion trillion

YottaBytes = trillion trillion


What is Data Mining?

Many definitions (each textbook will define its own)

Examples:

The process

The knowledge discovery in databases (KDD)


Bias considerations

As we progress through the class and wish to consider societal issues, biases, racism, etc., we will take the time.

Our text does not address biases in the algorithms.  We need to keep this in the forefront.  You are welcome to offer comments and recognitions of identified biases.

Are the algorithms biased?

Do the modeled results have more to do with data that trains the models?


Topics Overview


Data Mining Tasks

Prediction Methods (Supervised)

Description Methods (Unsupervised)

Here are several tasks applied to the same data set:

\


Predictive Modeling (Supervised): Classification

How is the model created? First start by splitting the data into training and test sets

Examples:

 

 


Regression (supervised)

Extensively studied in statistics. We won't spend much time here and assume you are familiar or will have ample opportunity with this approach.

Task: Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.

Examples:

While the algorithm is numeric in nature, categorical data using numeric coding and/or binarized data can leverage regression modeling.


Cluster Analysis (Unsupervised)

Task: Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

Examples

Market Segmentation:

Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

Approach:

Document Clustering:

Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.

Approach:


Association Rule Discovery (Unsupervised)

Given a set of records each of which contain some number of items from a given collection.

Produce dependency rules which will predict the occurrence of an item based on occurrences of other items.

Applications

Market-basket analysis: Rules are used for sales promotion, shelf management, and inventory management

Telecommunication alarm diagnosis: Rules are used to find combination of alarms that occur together frequently in the same time period

Medical Informatics: Rules are used to find combination of patient symptoms and test results associated with certain diseases

 

An Example:

Subspace Differential Coexpression Pattern from lung cancer dataset


Deviation/Anomaly/Change Detection

Detect significant deviations from normal behavior

Applications:


Motivating Challenges

Scalability

High Dimensionality

Heterogeneous and Complex Data

Data Ownership and Distribution

Non-traditional Analysis

 

What is ML vs Deep learning and their apps

Full scope of data science (graphic below from https://www.datasciencecentral.com/profiles/blogs/machine-learning-can-we-please-just-agree-what-this-means)

Deep Learning is Different from Traditional Predictive Analytics (also from above URL)

We won't spend a lot of time on NN but will want to give you the flavor of what's happening.