DATA MINING
DS 552  (3 credits)

FALL 2021 SYLLABUS

COURSE LINK TO MOODLE


Objectives 

Grading

Course policies 

Reading/
Homework
 

Top of Page | Course Materials | Course Policies | Grading & Objectives | Lecture Outline

Page last updated 08/18/2021


Course dates:
On-line Aug. 26 - Dec. 17, 2021
Bootcamp Saturday, Aug. 28, 2021 9:30 am - 3:00 pm

Instructor:

Loren Rhodes
E-mail: rhodes@juniata.edu  (preferred first point of contact)
Office location: Brumbaugh Academic Center, C203
Office: 814-641-3620

Cell: 814-644-3309
(texts are fine)

Office Hours are  Monday and Wednesday evenings as regular available times. See Moodle for exact times and Zoom link.

Moodle

Moodle is the course management system for this course and will be used for material access, assignment/project submissions with their timing and deadlines, and grade posting.Materials will be found there. 

Make sure you are logging into Moodle several times each week.

Required text and resources:

Alternative data mining and machine learning tools downloads or links:


Course description:

Course description: This course considers the use of machine learning (ML) and data mining (DM) algorithms for the data scientist to discover information embedded in wide ranging datasets, from the simple tables to complex data sets and big data situations. Topics include ML and DM techniques such as classification, clustering, predictive and statistical modeling using tools such as R, Python, Matlab, Weka and others.
Prerequisite: DS 500, DS 510 or by permission

Objectives:

This course considers the organization of data, the current techniques, overview of algorithms and tools in mining information from these sources.

Students will build skills and/or gain understanding in:

 

Top of Page | Course Materials | Grading & Objectives | Lecture Outline

 


Grading:

10%Weekly quizzes, on-line in Moodle

25% Assignments

25% Late Midterm (10th week)

Below are links to example paper exams from the undergraduate DS 352 (Fall '18) course. Use these to get a sense of testing form and style, but not necessarily a study guide.

40% Data mining project

Identify an existing, substantial data set that can be used to demonstrate the data mining techniques covered in class. The data set must meet size criteria as outlined in the detailed description. You will apply the data mining tools techniques covered in class on the data set for knowledge discovery and classification, present the results of the project during the last two weeks of the semester and turn in a written project in lieu of a final exam.

Detailed Project Description

 

 


Course Policies

My standard policies across all of my courses on attendance, late assignments, academic integrity, etc., are described on my Course Policies web page. Please read them carefully.

Accessibility Policy

Juniata is committed to provide equitable access for student learning.  To arrange for an accommodation based on a documented medical condition, mental health condition or learning disability (or if you suspect you have one), please contact Patty Klug, Director of Student Accessibility Services, by emailing her at klugp@juniata.edu or calling 814-641-5840.  I encourage you to confirm that I have received a copy of your accommodation letter and schedule a time for us to meet to discuss your needs. It is best to submit accommodation requests before the semester begins, although requests can be made at any time during the semester. 

 

Further details are found in Rhodes's course policies page.

Top of Page | Course Materials | Grading & Objectives | Lecture Outline


DATA MINING
FALL 2021
TENTATIVE COURSE OUTLINE

Topics sequence and readings based on the Introduction to Data Mining, 2nd ed., Tan et al text.

https://www-users.cs.umn.edu/~kumar001/dmbook/index.php#item4

Please note that videos of notes and lectures are found on the Moodle page.

Week
Date
Topics and Readings
Links for Study and Lectures

Exercises and Project Notes

Week 1

Course Introduction

Course overview

Ch 1

Introduction

Review Ch 1 ex: 1,3

See Getting Started with Python

 

Data Exploration Review
(1st edition book chapter)

Data types:
Ch 2.1

Exploration

Data types

Exercises for this review are linked here

Python tutorial for data exploration-- understand the code you are using. You will apply these to your chosen datasetThe Python script for Data Exploration is here.

Data types exercises: Ch2 ex: 2,3,4,5,7,9

Week 2

 

 

 

Data quality:
Ch 2.2

Data preprocessing:
Ch 2.3

Data: measures of similarity:
Ch 2.4

Data preprocessing overview

 

 

 

 

Similarities and Dissimilarities

 

 

Data quality exercises: Ch2 ex: 11,12

Identification of a dataset for analysis; preprocessing of the data

Python tutorial for data preprocessing-- understand the options and techniques. Script is here. The pic folder is here as a zip file.) Apply these to prepare your dataset as needed. Write up what you can do or still need to do.

Application of tutorial4 to a dataset

Apply the Python techniques for data exploration to your dataset. Write up your results.

Ch2 ex: 13,14,16,18,19 (reviewed in class)

What similarities and dissimilarities can you find with your dataset? Write up your observations into your project.

 

Week 3

 

 

Classification

Decision trees

Ch 3.1-3.3

 

Classification

Python implementation of the similarity measures. The exercise is linked here to load into Jupyter.  Upload your solution to Moodle.

Ch3 ex: 2, 3, 7, 8abc

Python tutorial on classification (sections 6.1, 3.2, 3.4.1). Script is here. Apply techniques to your project dataset. Write up your observations into your project.

 

 

Week 4

 

Model overfitting

Model selection and evaluation

Ch 3.4-3.8

Overfitting

Python tutorial on overfitting (section 3.3 from above tutorial)

Classification Homework

 

 

Week 5

 

Alternative classification techniques

Ch 4.1-4.3

Rule-based Classifier

Linear Regression (simple review)

Nearest Neighbor Classifier

Ch4 ex: 1, 3, 4ab, 17

Apply techniques to project dataset in Weka. Write up your observations into your project.

Optional: Python tutorial for regression. Script is here. Your dataset may lend itself to regression modeling. You should attempt some simple regression models. Write up your observations into your project, or why it doesn't apply.

 

Week 6

 

Ch 5.1-5.4 (maybe not 5.4)

Association analysis

Association analysis advanced concepts

5.7.3 Simpson's paradox

Ch 6.1-6.4.2, 6.5

Association Rule generation

Additional Assoc. Rule topics

Association Rule Advanced Topics

Ch5 ex: 2a-d,6a-d, 7a, 8, 9ab, 10, 11a, 13a, 14 (you will do variations on #2 and #8 for the homework below)

Ch6 ex: 1, 2, 5, 6, 9, 10, 11, 12, 13ab, 14 (you will do variations on #1,#2 and #5)

Association Rules Homework

Apply techniques to your project dataset in Weka. Write up your observations into your project.

Week 7

Clustering Analysis

Kmeans

Ch 7

Sequence Pattern Mining

Subgraph mining (not considered)

 

Ch7 ex: 2, 5, 6ab, 7

 

Week 8

 

Clustering and Hierarchical clustering

Ch 7

Clustering Analysis

Hierarchical clustering

Python tutorial on cluster analysis. Script is here.--Apply techniques to your project dataset. Write up your observations into your project.

Week 9  

 

DBSCAN Clustering

Ch 7.4 and Ch 8.3

DBSCAN

 

Clustering homework

Ch7 ex: 13, 16, 20

Week 10

 

Midterm    

Week 11

 

Cluster Validity

Ch 7.5

Anomaly Detection

Ch 9

Cluster Validity

Anomaly/Outlier Detection

Anomaly homework (adapted from tutorial below)

Ch9 ex: 1,2,3

Week 12

 

Neural Nets

Ch 4.5-4.9

Artificial Neural Networks

 

 

See Python tutorial for Anomaly Detection. Script is here.--Apply techniques to your project dataset. Write up your observations into your project.

Week 13

 

 

Ensemble methods ch 4.10

Class imbalance problem
ch 4.11

Avoiding false discoveries

Ch 10

Ensemble Methods

Class Imbalance

Neural Net project

 

 

Week 14

 

Final project submission, presentation and reactions

 

 

 

 

       

Top of Page | Course Materials | Grading & Objectives | Lecture Outline