Overall Project Timeline and Details

Pairs work must be approved

Updated 21-oct-20

Abstract of project

Identify an existing, substantial data set that can be used to demonstrate many of the data mining techniques that we will cover in class. The data set must meet size criteria as outlined in the detailed description. You will use existing data mining tools and techniques on the data set for information discovery, present an on-line, interactive display of the project in the last week of the semester and turn in a written project during the final period, in lieu of a final.


I. Project Proposal--Due Friday, Aug. 28, 5 p.m.

Objectives of the proposal

Your proposal should include the elements below as sections of your document.

Search for an existing set of data from which you can apply numerous machine learning algorithms.

Do not use any sample data sets from this textbook or other texts. These are too small for realistic mining experiences to be gained.

You don't need to propose any hypotheses, although the data set should be have a/some fairly obvious dependent variable(s).

It is important that you would have access to a substantial body of data so that you can fully appreciate the design needs. "Substantial" should mean that there could be hundreds of thousands of records or data points in the domain. What you actually work with could be much smaller, however.

Your dataset should minimally have 200 rows/observations/data points and 8-15 columns (attributes) or descriptors. A variety of data types are preferred. Continuous ordinal, discrete ordinal, nominal, categorical, at least. Geographical or temporal data would be good but not necessarily required. An excessively large number rows may need to be randomly sampled to make it small enough for some tools to function--but this is good.

You may consider combining data sources to create enough attributes. For example, if your data set had daily or weekly observations, you might join weather data to augment your data set.


A. Dataset Narrative.

  1. Give a title to your project.
  2. Provide a summary description of the data set.
  3. While no hypotheses are really necesary, describe what you expect to find.

B. Audience.

  1. Who is the audience? Who would be interested in your results?
  2. What are their likely demographics? While we assume the audience for the most part is drawn from the general public, you should describe as many characteristics as you can of likely readers of you report/presentation. Are there any implications for the use of the data set by your particular audience?
  3. Why is this data set interesting personally?

C. Data sources

  1. List the URLs of the web sites or databases for your data set sources. Please describe how you located them. Include all of their URLs and any other necessary access information. Include the file formats (csv, json, etc)
  2. In a data dictionary table, describe each column (or similar groups of columns)

Show snippets (5-10 rows and most, if not all, columns) of the data to complement your descriptions.


Post the Word, PDF, or HTML document that contains the above 3 components into Moodle by the due date.


II. Revised Project and Exploratory Work--Due Friday, September 18, 5 p.m.


Revisions of your initial project design incorporating many of the concepts and ideas from the class to this point applicable to your planned data set.

Data set exploration through simple statistics and visualizations.

Identify the data cleaning and transformations you may need to apply.


Each of these components must start on separate pages in an electronic document (Word).

A. Revised/updated dataset narrative and audience descriptions

See parts A and B from the project proposal above

B. Data sources revised/updated

See part C from the project proposal above.

C. Data preprocessing (new section)

You should complete or describe much of the data preprocessing for this submission of the project. Document your data preprocessing. Consult with the instructor for help on any of these parts before the due date.

  1. What merging of data collections are necessary?
  2. What data cleaning are necessary?
  3. What attributes might you eliminate? Why?
  4. What tools do you expect to use or have used to generate a final CSV file?

D. Simple, exploratory Visualizations and data summaries (new section)

Consult with the instructor for help on any of these parts before the due date.

  1. Scatterplots (scatterplot matrix of all attribute pairs)
  2. Other accessible visualizations and summaries via Python (Pandas) RapidMiner, Tableau, Weka
  3. Include Python scripts, or screens shots of the workflows.
  4. Note any preliminary observations


Upload the Word/PDF/HTML document into Moodle.



III. Midterm class presentation--Friday, October 2

Prepare a 5-7 minute presentation of your project to date to the class as a preliminary overview of your presentation at the end of class.

A. Project Description (<1 min)

Provide a basic abstract of your dataset and audience.

B. Data sources (<1 min)

Describe your data sources. Identify all the existing web sites, databases for your body of data, or other data sources. Include URLs of them and describe the data layouts.

C. Data preprocessing (1 min)

The data preprocessing for the project should be finished. Describe any interesting data preprocessing that you needed to do.

D. Data exploration/visualization (1 min)

Using some data science tools from R, Python and/or Weka in data exploration mode.

F. Preliminary data mining results (2 min)

IV. Milestone check--Due Sunday, November 1, 11:59 p.m.

Submit a progress report, since your oral presentation, on all of the machine learning algorithms you applied to your project showing the results.

Include a summary of observations of those results.

You do not need to resubmit your project description/proposal, unless there is an addendum of information.

Submit your Word/PDF/HTML document to Moodle.

V. Final Class Presentation--Friday/Monday, November 20/23

The expectation is for you to give a presentation of about 7-8 minutes with time for questions, comments and suggestions.  The presentation is a coherent unit of your project data set description, sample data and visualizations and tentative data mining.

A. Project Description and data sources(<1 min)

Review the basic abstract of your dataset and audience.

Describe your data sources. Identify all the existing web sites, databases for your body of data, or other data sources. Include URLs of them and describe the data layouts.

B. Data exploration/visualization (1 min)

Any background necessary to help explain results of the machine learning algorithms used.

C. Overview your data mining results (5 min)

D. Q and A (1 min)


VI. Final Project Submission, Due Wednesday, December 2, 2020, 9 a.m.
(this is an absolute deadline)


Document your final project design incorporating the issues and ideas of visualization and data mining from the class. Keep in mind this project is in lieu of a final. It is expected that you should attempt to apply as much class material as you can into the project. The better project write ups are 15-20 pages.

The project must involve a substantial quantity of data so that visualization and data mining techniques can be applied and that navigation of several methods are warranted.


Upload your document (Word, PDF, or HTML) and any other supporting files into Moodle.

A. Project Description

In a one page abstract (no more), generally describe the domain of data for which you have analyzed. Also describe the intent and use of your results. How will these results be useful?

B. Data sources

C. Application of machine learning algorithms