Topic 17: Standard Deviation, z-Score, and Normal Distributions

Two measures of spread we discussed in Topic 16 were range and interquartile range. While useful, range is susceptible to outliers, since it only uses the maximum and minimum values of the dataset. The interquartile range uses more information about the dataset, since it depends on the middle 50% of the data, and it is resistant to outliers. But the variability of a data set depends on all the data. In this topic we discuss a measure of spread that does depend on all the data, standard deviation.

Here are the steps to calculate the standard deviation:
1. Compute the mean.
2. For each data value, calculate its deviation from the mean. The deviation from the mean is determined by subtracting the mean from the data value. Note that if we added all these deviations from the mean for one dataset, the sum would be 0 (or close, depending on round-off error).
3. Square each deviation from the mean.
4. Sum the squares of the deviations.
5. Divide the sum in #4 by (n - 1). Note that the text says," there are important statistical reasons we divide by one less than the number of data values."
6. Take the square root of the value in #5, which will give the standard deviation.

Now, let's look at an example where standard deviation helps explain the data.

Consider the following three datasets:
(1) 5, 25, 25, 25, 25, 25, 45
(2) 5, 15, 20, 25, 30, 35, 45
(3) 5, 5, 5, 25, 45, 45, 45

The mean, median, and range are all the same for these datasets, but the variability of each dataset is quite different.

Let's calculate the standard deviation for each dataset:

i

 

xi

 

 

xi-mean

 

 

(xi-mean)2

 

 

i

 

xi

 

 

xi-mean

 

 

(xi-mean)2

 

 

i

 

xi

 

 

xi-mean

 

 

(xi-mean)2

 

1 5 -20 400   1 5 -20 400   1 5 -20 400
2 25 0 0   2 15 -10 100   2 5 -20 400
3 25 0 0   3 20 -5 25   3 5 -20 400
4 25 0 0   4 25 0 0   4 25 0 0
5 25 0 0   5 30 5 25   5 45 20 400
6 25 0 0   6 35 10 100   6 45 20 400
7 45 20 400   6 45 20 400   6 45 20 400
    sum = 800       sum = 1050       sum = 2400
    sum/(n-1) = 133.33       sum/(n-1) = 175       sum/(n-1) = 400
    std. dev. = 11.55       std. dev. = 13.23       std. dev. = 20

The standard deviations for the datasets are 11.55, 13.23, and 20. The larger standard deviations indicate greater variability in the data, and in general we can say that smaller standard deviations indicate less variability in the data.

One other point regarding the averages and measures of spread. The median, five-number summary, range, and interquartile range are useful for describing a skewed dataset. The mean and the standard deviation are more often used on fairly symmetric data without outliers.

One example of a distribution that is "roughly symmetric without outliers" is a normal distribution. Normally distributed data has a single peak, and roughly follows a bell shape. In fact, sometimes it is called a bell curve.

All normal curves start concave-up and switch to concave-down 1 standard-deviation less than the mean. The peak of the curve occurs above the value on the horizontal axis that corresponds to the mean. Then after the peak, the curve changes concavity, back to concave-up, 1 standard-deviation greater than the mean. A normal (or nearly normal dataset) is also symmetric with respect to the mean.

The graph above shows two normal curves with the same mean. The curve with a higher peak has a smaller standard deviation; the data values with a resulting histogram that gives rise to the taller and narrower normal curve are less spread out along the horizontal axis than those values leading to the shorter curve. Can you estimate the standard deviation of each data set? (Look for where the change in concavity occurs).

There is a useful empirical rule (that is, a rule that is derived from practical experience) that applies to any data set which follows an approximately normal distribution. It says that if the observations in a data set can be approximated by a normal curve, the approximately 68% of the data values are within one standard deviation of the mean; approximately 95 percent of the data values are within two standard deviations of the mean, and approximately 99.7 percent of the data is within three standard deviations of the mean.

Sometimes, we want to compare values from different datasets. One way to do this is to standardize the values, by calculating how many standard deviations each observation lies from the mean. We can then compare these standardized values, which are called z-scores, and assess which observation is most extreme. This is one method we could use to answer the question about exams, which was posed at the beginning of the Topic 16 notes.

To demonstrate calculating z-scores, we can try this on the following two datasets from our original example.

i

 

xi

 

z-score

(xi-mean)
std. dev.

 

 

i

 

xi

 

z-score

(xi-mean)
std. dev.

 

1 5 (5-25)/13.23 = - 1.53   1 5 (5-25)/20.0 = - 1.00
2 15 (15-25)/13.23 = - 0.76   2 5 (5-25)/20.0 = - 1.00
3 20 (20-25)/13.23 = - 0.38   3 5 (5-25)/20.0 = - 1.00
4 25 (25-25)/13.23 = 0   4 25 (25-25)/20.0 = 0
5 30 (30-25)/13.23 = 0.76   5 45 (45-25)/20 = 1.00
6 35 (35-25)/13.23 = 0.38   6 45 (45-25)/20 = 1.00
7 45 (5-25)/13.23 = 1.53   6 45 (45-25)/20 = 1.00