Data Preprocessing Exercises

DS 352 Syllabus

last updated 28-Aug-2020

Introduction

Modify the tutorial4.ipynb exercise to do a few steps with this university salary and faculty size data set: aaup.csv

Here is a terse description of the variables in the dataset.  You can use the variable names or create your own

Univ_id: id number
Univ_name: Name of institution
State: 2 letter state code
Type:  (I, IIA, or IIB)
fp_sal: Average salary - full professors
ac_sal: Average salary - associate professors
at_sal: Average salary - assistant professors
to_sal: Average salary - all ranks
fp_com: Average compensation - full professors    
ac_com: Average compensation - associate professors
at_com: Average compensation - assistant professors
to_com: Average compensation - all ranks          
fp_#: Number of full professors    
ac_#: Number of associate professors
at_#: Number of assistant professors
in_#: Number of instructors        
to_#: Number of faculty - all ranks

Tasks

Replicate the preprocessing steps applied to the breast cancer example as guided below:

1. Input the data into a Pandas dataframe; create the data columns of your choice; print the number of observations and attributes.

2. Recode the missing values to NaN.  This dataset uses *.  Print the counts of missing values across the attributes.

3. The median is not a good choice for replacing missing.  Do you have any suggestions?  What can you try? (put answer into a markdown box)

4. Explore for outliers.  Apply the boxplot display.   What does the Z-score indicate here? Are there any outliers? (put responses in markdown)

5. Are there any duplicate records?

6. Can you aggregate the institutions within each state using the grouping operation from Pandas? So you should end up with ~50 observations.  Which statistics are you aggregating on? (describe in a markdown box)

7. Explore some sampling from the original data set, not the aggregate. What did find to be best? Why? (put answer in a markdown box)

8. Pick a salary column to discretize and pick a count to discretize.  Why did you choose your type of descretizations?  (put in a markdown box)

 

Submission

Export your results to an html file and upload that as your solution into Moodle.