Math 225

Introduction to Biostatistics

Exploratory Data Analysis II: Quartiles, Boxplots, and Standard Deviation

Prerequisites

This lab assumes that you already know how to:

login to the system computers;
use a browser to find the course Web page;
start the S-PLUS software;
move back and forth between the browser and S-PLUS;
use the Commands Window to find means and medians;
enter data into S-PLUS in several ways.

Technical Objectives

This lab will teach you to:

find quartiles;
construct boxplots;
construct side-by-side boxplots;
and calculate standard deviations.

Conceptual Objectives

In this lab you should learn to:

identify skewness from boxplots;
use side-by-side boxplots to make qualitative comparisons between the distribution of a quantitative variable measured on two or more groups;
and interpret the standard deviation;

In-class Activities

Finding quartiles.
The following data is from Exercise 2-4 on page 35 of the textbook and represents the time until death in hours for thirteen sheep that were fed a toxic weed as part of an experiment.
```
44 27 24 24 36 36 44 44 120 29 36 36 36
```
Create a variable named deathTime with this data using the scan function in the Commands Window. [How?]
Use the quantile function to find the five number summary including lower and upper quartiles.
```
> quantile(deathTime)
   0%  25%  50%  75% 100% 
   24   29   36   44  120
```
Making boxplots.
A boxplot is a graphical display of the five number summary (minimum, lower quartile, median, upper quartile, and maximum). S-PLUS draws boxplots vertically, so the quantitative variable is on the y-axis. (This is different from histograms, where the variable values are along the x-axis.) The middle half of the data is represented by a box, extending from the lower quartile to the upper quartile, split at the median. Simple boxplots have whiskers drawn from the top and bottom of the box to the maximum and minimum respectively.
Boxplots in S-PLUS also identify potential outliers. The interquartile range (IQR) is the distasnce between the upper and lower quartiles seen graphically as the height of the box. Any individual observations that are more than 1.5 IQR units from the box are identified separately with a horizontal line. The whiskers extend to the maximal and minimal observations that are not potential outliers.
Skewness is easily seen in a boxplot. If one whisker and box half are longer than the other, the distribution is skewed. If the lower values are more spread out than the upper values, we say the data is skewed to the left. If the upper values are more spread out than the lower values, we say the data is skewed to the right. A distribution that is symmetric will not be skewed.
Find the HARVEST data set on the course Web page and save it to the Desktop. Import the data into S-PLUS. [How?]
Attach the data frame so that we may refer to variables by name. [How?]
Display the variable HRCB in a boxplot by following these directions.
1. Just like with histograms, click on the ``2D Plots'' button.
2. Click on the boxplot button, which has a picture of a small vertical rectangle with small dots above and below.
3. Choose the harvest data set.
4. Because the variable will be drawn on the y-axis, click on the arrow to the right of ``y Column(s)'' and select the variable HRCB.
You should check that the boxplot agrees with the calculated quartiles and median.
```
> quantile(HRCB,na.rm=T)
```
Is this distribution skewed or fairly symmetric?
Making side-by-side boxplots.
Does smoking affect blood pressure in this data set? You can imagine sorting the individuals by their smoking status, calculating a five-number summary for each, and drawing separate boxplots on the same axis. To do this in S-PLUS, follow these steps.
1. Open the ``Plots2D'' palette and click on the boxplot button.
2. Select harvest as the data frame.
3. Select the categorical variable SMOKE in ``x Column(s)''.
4. Select the quantitative variable DBPCB in ``y Column(s)''.
5. Click on the okay button.
Does the graph show markedly different blood pressures depending on smoking status? Does dosage (number of cigarettes per day) seem to matter much?
Finding the standard deviation.
The standard deviation is the square root of the variance. For fairly symmetric data, it may often be interpreted as a typical distance from the mean. For bell-shaped distributions, these facts are usually approximately true.
1. About 68% of the observations are within one standard deviation from the mean.
2. About 95% of the observations are within two standard deviations from the mean.
3. Nearly all observations are within three standard deviations from the mean.
These approximations will not be very good for strongly skewed data. The standard deviation of a single variable may be found in the Commands Window. There is no built in function for the standard deviation, but var finds the variance and sqrt finds the square root.
```
> sort(deathTime)
 [1]  24  24  27  29  36  36  36  36  36  44  44  44 120
> mean(deathTime)
[1] 41.23077
> sqrt(var(deathTime))
[1] 24.68182
```
Sorting the data and calculating the mean are not necessary. Note that the single observation of 120 is an outlier and skews the distribution to the right. Only four observations are larger than the mean while nine are smaller and the mean is larger than the median. In this case, the standard deviation is not a "typical" deviation. The outlier 120 is much farther away than the mean while all other observations are substantially closer.
You can plot a boxplot of this data from the Commands Window as well.
```
> boxplot(deathTime)
```
See the skewness in the plot and notice the outlier at 120.
Calculating summary statistics for a large data set.
You can compute a single report with summary statistics (means, medians, quartiles, standard deviation, etc.) for more than one variable in a data set By following these directions.
1. From the ``Statistics'' menu, select ``Data Summaries'' and then ``Summary Statistics...''.
2. Choose the data frame and click on or off any statistics you wish to see. You can do all variables simultaneously or one or several at a time. click on an individual variable to do only one.
3. Clicking on a second variable while holding the shift key allows you to see summary statistics for all variables between the two. A report sheet opens with the summary statistics.

Homework Assignment

Load the cereal data set into S-PLUS and answer the questions below. You should write your answers on this form and turn it in to your lab instructor by the due date.

Find a variable from the cereal data set that is skewed to the right. This can be the same variable you used for the previous assignment. Use S-PLUS to draw a boxplot of the variable and then sketch it on paper.
Make a box-and-whisker plot of the variable potass (mg potassium per serving). Describe how this plot indicates the presence of potential outliers. Identify the brands of cereal which are outliers. To do this, it may be helpful to display the potassium values and cereal names, sorted from largest potassium value to the smallest, in the Commands Window.
```
> attach(cereal)
> ord <- rev(order(potass))
> data.frame(name=name[ord],potass=potass[ord])
```
What characteristic of the cereal, apparent from the name, is common to the brands that are outliers?
Construct a side-by-side box-and-whisker plot of sugar versus shelf. Sketch this on your answer sheet.
Does the distribution of sugar content vary by grocery shelf? Give a hypothetical explanation for the pattern you see in the plot.

Last modified: January 4, 2001

Bret Larget, larget@mathcs.duq.edu