DS 352
Fall 2020 SYLLABUS

Instructor: Loren Rhodes
Office: C-208 BAC, 814-641-3620 
Email: rhodes@juniata.edu
Cell: 814-644-3309

Rhodes Office Hours; see office door for last minute changes; others by appt.

Top of Page | Course Materials | Course Policies | Grading & Objectives | Lecture Outline

Page last updated 11/08/2020

Meeting times: MWF 12:00-12:50 in P107/C229

Course description: This course considers the use of machine learning (ML) and data mining (DM) algorithms for the data scientist to discover information embedded in datasets from the simple tables through complex and big data sets. Topics include ML and DM techniques such as classification, clustering, predictive and statistical modeling using tools such as R, Python, Matlab, Weka and others. Simple visualization and data exploration will be reviewed in support of DM. Software techniques implemented the emerging storage and hardware structures are introduced for handling big data.
Prerequisite: CS 110, DS 110, and an approved statistics course: MA 220, BI 305, PY 214 or EB 211
Fulfills: N requirements


In a nutshell, this course considers the organization of data, the current techniques, algorithms and tools in mining information from these sources.

Students will build skills and/or gain understanding in:


Top of Page | Course Materials | Grading & Objectives | Lecture Outline

Required text and resources:

Data mining and machine learning tools downloads or links:

Text and data links:


Exams 45%, in class, at 15%

Below are links to exams from DS 352 (Fall '18) and  IM 241 Information Discovery. IM 241 exams are given only as examples of exam style and topic depth of the instructor. There are some data mining topics covered but mostly visualization topics. Don't rely on the IM 241 exams for study guides--please.

  Fall '18 (DS 352) Fall '14 (IM 241) Fall '13 (IM 241)
Exam 1 Test -- Key Test -- Key Test -- Key
Exam 2 Test -- Key Test -- no key Test -- Key
Exam 3 Test -- no key Test -- no key Test -- no key

Homework and active class participation 20%

Data mining project 35%

Identify an existing, substantial data set that can be used to demonstrate the data mining techniques covered in class. The data set must meet size criteria as outlined in the detailed description. You will apply the data mining tools techniques covered in class on the data set for knowledge discovery and classification, present the results of the project during the last two weeks of the semester in December and turn in a written project during the final period.

Detailed Project Description

Course Policies

My standard policies across all of my courses on attendance, late assignments, academic integrity, etc., are described on my Course Policies web page. Please read them carefully.

Accessibility Policy

Juniata College is committed to providing equitable access for learning opportunities to students. If you are affiliated with the Student Accessibility Office and have been determined eligible to receive accommodations, I encourage you to confirm that I have received a copy of your accommodation letter and schedule a time for us to meet to discuss your needs in this course. Although it is preferable to request before the semester begins, requests can be made at any time, but are not retroactive. Any student who feels they may need an accommodation based on a documented medical condition, mental health condition, or learning disability (or suspects they may have one), is encouraged to contact Patty Klug, Director of Student Accessibility Services, at klugp@juniata.edu or 814-641-5840. Her office is located in Founders Hall, office #213.


Further details are found in Rhodes's course policies page.

Top of Page | Course Materials | Grading & Objectives | Lecture Outline

FALL 2020

Topics sequence and readings based on the Introduction to Data Mining, 2nd ed., Tan et al text.


Please note that videos of notes and zoom captures of lectures are found on the Moodle page.

Topics and Readings
Links for Study and Lectures

Exercises and Project Notes

Week 1

Course Introduction

Course overview

Ch 1


Class discussion Ch 1 ex: 1,3

See Getting Started with Python

Also download and install the academic version of RapidMiner



Data Exploration Review
(1st edition book chapter)

Exercises for this review are linked here

Python tutorial for data exploration-- understand the code you are using. You will apply these to your chosen dataset (see 8/26). The Python script for Data Exploration is here.

Work through the first several RapidMiner tutorials.


Data types:
Ch 2.1

Data types

Data types exercises: Ch2 ex: 2,3,4,5,7,9 (reviewed in class)

Week 2


Data quality:
Ch 2.2
Data preprocessing overview



Data quality exercises: Ch2 ex: 11,12 (reviewed in class)





Data preprocessing:
Ch 2.3

Identification of a dataset for analysis; preprocessing of the data

Python tutorial for data preprocessing-- understand the options and techniques. Script is here. The pic folder is here as a zip file.) Apply these to prepare your dataset as needed. Write up what you can do or still need to do.

Application of tutorial4 to a dataset

Apply the Python techniques for data exploration to your dataset. Write up your results.



Week 3


Data: measures of similarity:
Ch 2.4

Similarities and Dissimilarities

Ch2 ex: 13,14,16,18,19 (reviewed in class)

What similarities and dissimilarities can you find with your dataset? Write up your observations into your project.




Week 4




Decision trees

Ch 3.1-3.3



Python implementation of the similarity measures. The exercise is linked here to load into Jupyter.  Upload your solution to Moodle.

Ch3 ex: 2, 3, 7, 8abc

Python tutorial on classification (sections 6.1, 3.2, 3.4.1). Script is here. Apply techniques to your project dataset. Write up your observations into your project.



Model overfitting

Model selection and evaluation

Ch 3.4-3.8


Python tutorial on overfitting (section 3.3 from above tutorial)


Week 5

9/14 Exam 1


Catch up and review



Classification Homework


Week 6


Alternative classification techniques

Ch 4.1-4.3

Rule-based Classifier


Ch4 ex: 1, 3, 4ab, 17

Apply techniques to project dataset in Weka. Write up your observations into your project.





Association analysis


Linear Regression (simple review)



Optional: Python tutorial for regression. Script is here. Your dataset may lend itself to regression modeling. You should attempt some simple regression models. Write up your observations into your project, or why it doesn't apply.

Week 7




Ch 5.1-5.4 (maybe not 5.4)

Project presentations

Nearest Neighbor Classifier

Association Rule generation


Week 8




Association analysis advanced concepts

5.7.3 Simpson's paradox

Ch 6.1-6.4.2, 6.5

Additional Assoc. Rule topics

Association Rule Advanced Topics

Ch5 ex: 2a-d,6a-d, 7a, 8, 9ab, 10, 11a, 13a, 14 (you will do variations on #2 and #8 for the homework below)

Ch6 ex: 1, 2, 5, 6, 9, 10, 11, 12, 13ab, 14 (you will do variations on #1,#2 and #5)

Association Rules Homework

Apply techniques to your project dataset in Weka. Write up your observations into your project.

Week 9  



10/16 Exam 2


Clustering Analysis


Ch 7


Sequence Pattern Mining

Subgraph mining (not considered)

Clustering Analysis

Ch7 ex: 2, 5, 6ab, 7

Python tutorial on cluster analysis. Script is here.--Apply techniques to your project dataset. Write up your observations into your project.

Week 10




Hierarchical clustering


Ch 7

Hierarchical clustering


Clustering homework

Ch7 ex: 13, 16, 20

Week 11




DBSCAN Clustering

Ch 7


Cluster Validity



Week 12




Anomaly Detection

Ch 9

Neural Nets

Ch 4.5-4.9

Anomaly/Outlier Detection



Artificial Neural Networks


Anomaly homework (adapted from tutorial below)

Ch9 ex: 1,2,3

See Python tutorial for Anomaly Detection. Script is here.--Apply techniques to your project dataset. Write up your observations into your project.

Week 13




Ch 4.5-4.9



Neural Net project




Week 14


Ensemble methods ch 4.10

Class imbalance problem
ch 4.11

Avoiding false discoveries
Ch 10

Ensemble Methods

Class Imbalance





11/18 Exam 3


Week 15


Project presentations

Final Project due Wednesday 12/2 at 9 a.m. No late exceptions      
Top of Page | Course Materials | Grading & Objectives | Lecture Outline