To introduce these ideas, we analyze the data from a student experiment to explore the nature of the relationship between a person's heart rate and the frequency at which that person stepped up and down on steps of various heights. The student experimenters tested the effects of two different step heights crossed with three different stepping speeds, for a total of six different treatment combinations. In addition, they used six different subject/experimenter combinations as blocks. They used a balanced incomplete block design, meaning that every pair of factor levels appears in an equal number of blocks. For this experiment, each of the six height/frequency combinations appears in five blocks, where a block is a student exerciser and a pair of experimenters.
% cp ~/larget/496/step.dat ~/496/step.dat % Splus -e > step.frame <- read.table("step.dat",header=T)Any variable that takes on character values will be interpreted as a factor, and any variable with only numerical values will be interpreted as quantitative by default. For the data we have read in, Time and Frequency should also be ordered factors. This S-PLUS code changes the status of both variables.
> step.frame$Frequency <- ordered(step.frame$Frequency, + levels = c("slow","moderate","fast")) > step.frame$Time <- ordered(step.frame$Time)For Frequency, you need to explicitly tell S-PLUS the proper order. For Time, S-PLUS uses the numerical order by default.
You can check that all variables are correctly classified by using the function sapply, which is similar to apply, and applies a specified function to all objects in a list, or in the case of data frames, to each column. The function is.factor returns T if a column is a factor or an ordered factor. The function is.ordered returns T if a column is an ordered factor.
> sapply(step.frame,is.factor) > sapply(step.frame,is.ordered)The response variable we are interested in is the difference between heart rate (HR) and resting heart rate (RestHR). Create a variable called DiffHR for this difference and add it to the data frame.
> attach(step.frame) > step.frame$DiffHR <- HR - RestHR > attach(step.frame)
> plot.factor(DiffHR ~ Block + Time + Height + Frequency)Each of these pictures gives an informal indication as to whether a given factor should be included in the model. Another useful summary plot is created with plot.design.
> plot.design(step.frame)
Put your data into a data frame with each column correctly specified as a quantitative variable, factor, or ordered factor. Section 10.1 in the textbook gives more details on the syntax for doing these tasks in S-PLUS.
Fit a model using the appropriate function. Analysis of variance models are fit with the S-PLUS function aov. This example continues the data introduced above. To fit a model that predicts the difference in heart rate based on the group of students (Block), number in sequence of a group of experiments (Time), height of the step (Height), and frequency with which the exerciser steps (Frequency),
> fit <- aov(DiffHR ~ Block + Time + Height + Frequency, data=step.frame)This creates an object named fit which contains all necessary information about the fit. The textbook gives more details on changing the formula statement to fit alternative models.
Use diagnostic tools to examine the quality of the fitted model. Other functions may then be used to extract information about the model. For example,
> summary(fit)produces the ANOVA table. If you wish to extract the fitted coefficients,
> coef(fit)does the job. If you want to work with the fitted values or residuals,
> fv <- fitted.values(fit) > r <- residuals(fit)will do the trick. Notice that the functions used above are identical to those used to examine the object created by fitting a regression model and, in fact, will work for the several other models available in S-PLUS.
The function plot applied to fit produces a few diagnostic plots. Again, you may exercise more control over what you wish to view by working with the residuals, fitted values, and data directly. For example,
> plot(fitted.values(fit),residuals(fit),xlab="Fitted Values", + ylab="Residuals") > abline(0,0)produces a plot of the residuals versus the fitted values.
Just as with regression, finding the appropriate model to describe the data requires fitting several models and examining the fits interactively. S-PLUS gives you the tools to do this. This course attempts to show you how to use these tools computationally. Courses in applied statistics are necessary for you to learn more about why these tools can be effective.
The book Modern Applied Statistics with S-Plus by Venables and Ripley and the book Statistical Models in S edited by Chambers and Hastie (of which I have given you an excerpt) include much more detailed information on practical model fitting using S-PLUS.
Bret Larget, larget@mathcs.duq.edu