# Linear Models in SAS

## (Regression & Analysis of Variance)

The main workhorse for regression is proc reg, and for (balanced) analysis of variance, proc anova. The general linear model proc glm can combine features of both. Further, one can use proc glm for analysis of variance when the design is not balanced. Computationally, reg and anova are cheaper, but this is only a concern if the model has 50 or more degrees of freedom. Return to SAS Introduction or Information on SAS.

### Regression

Here are simple uses of proc reg for standard problems:
```proc reg;		/* simple linear regression */
model y = x;

proc reg;		/* weighted linear regression */
model y = x;
weight w;

proc reg;		/* multiple regression */
model y = x1 x2 x3;
```
The model phrase indicates which variables are response (y) and which are predictors (x, or x1,x2,x3). Here are some print options for the model phrase:
```   model y = x / noint;		/* regression with no intercept */
model y = x / ss1;		/* print type I sums of squares */
model y = x / p;		/* print predicted values and residuals */
model y = x / r;		/* option p plus residual diagnostics */
model y = x / clm;		/* option p plus 95% CI for estimated mean */
model y = x / cli;		/* option p plus 95% CI for predicted value */
model y = x / r cli clm;	/* options can be combined */
```
CAUTION: SAS listings label the standard error of the estimated mean as the STD ERROR PREDICT. Be wary and know what these things mean! Some of the residual diagnostics go beyond the material cover here. You may explore these on your own.

It is possible to let SAS do the predicting of new observations and/or estimating of mean responses. The way to do this is to enter the x values (or x1,x2,x3 for multiple regression) you are interested in during the data input step, but put a period (.) for the unknown y value. That is,

```data new;
input x y;
cards;
1 0
2 3
3 .
4 3
5 6
;
proc reg;
model x = y / r cli clm;
```
Try it, and check standard errors and confidence intervals by hand. Here are some other model options for more advanced stuff:
```   model y = x / covb;	/* covariance matrix for estimates */
model y = x / collin;	/* collinearity diagnostic */
model y = x / collinoint;	/* collin without intercept */
```
The output phrase can have several keywords (which can be used together):
```   output out=b predicted=py;	/* predicted values in "py" */
output out=b p=py;		/* same as predicted */
output out=b residual=ry;	/* residual values in "ry" */
output out=b r=ry;		/* same as residual */
output out=b stdr=sr;	/* standard error of residuals "sr" */
output out=b student=sy;	/* studentized residuals "sy" */
```
Only one output phrase can be used, but you can combine keywords on one line:
```   output out=b p=py r=ry stdr=sr student=sy;
```
Those new variables created in set b are available for later plotting, etc.

### Analysis of Variance

Experiments involving a single factor or several factors with no missing data (balanced designs) can use the quick and easy proc anova to analyze the variation explained by those factors (analyis of variance, or ANOVA). More complicated ANOVA designs can be done PROVIDED the data are balanced. However, designs with imbalance among two or more factors should use proc glm.
```proc anova;			/* one-way analysis of variance */
class trt;
model y = trt;

proc anova;			/* 1-way with multiple comparisons */
class trt;
model y = trt;
means trt / lsd snk;		/* LSD and Student-Neumann-Kohl */

proc anova;			/* two-way anova */
class fert var;
model y = fert var;
means fert var / lsd;	/* means by fert and var with LSD */

proc anova;			/* two-way anova with interaction */
class fert var;
model y = fert var fert*var;	/* interaction signified by asterisk */
means fert var / lsd;
means fert*var;		/* for each fert-var combination */
```
The class phrase is required, identifying all factors as categorical variables. The model phrase has only a few options, and these are not often used. The means phrase is quite handy to do multiple comparisons. Options include:
```   means trt / t;		/* Least Significant Difference */
means trt / lsd;		/* Least Significant Difference */
means trt / bon;		/* Bonferroni */
means trt / snk;		/* Student-Newman-Keuls */
means trt / lsd alpha=.05;	/* LSD at level 5% (default) */
means trt / lsd lines;	/* force ordering of means */
means trt / lsd cldiff;	/* force pairwise tests of means */
```
The lines to means option is default when data are balanced. The cldiff option can be useful at times, but it only gives differences CI for the differences, not the means themselves. None of these options works when looking at 2-way combinations such as means fert*var;.

If you want to save predicted values or residuals, or to evaluate contrasts, you must use proc glm instead of proc anova. See below.

### Analysis of Covariance (ANCOVA)

Analysis of Covariance, or ANCOVA, combines features of ANOVA and regression. That is, examine treatment differences adjusted for covariate. Similarly, determine whether there is a significant relationship between X and Y after adjusting for treatment. ANCOVA is typically done using proc glm.
```proc glm;			/* analysis of covariance */
class trt;			/* trt = factor, x = covariate */
model y = x trt;

proc glm;			/* analysis of covariance */
class trt;			/* with different slopes */
model y = x trt x*trt;
```
More advanced use of ANCOVA can be found in the section on Multiple Responses.

### General Linear Models (GLM)

The general linear models (GLM) procedure works much like proc reg except that we can combine regressor type variables with categorical (class) factors. The organization of the printout is slightly different from reg and anova, and some model and output options are different. Further, if you want model parameter estimates, it is best to explicitly request the solution option in the model phrase.
```proc glm;		/* simple linear regression */
model y = x / solution;

proc glm;		/* weighted linear regression */
model y = x / solution;
weight w;

proc glm;		/* multiple regression */
model y = x1 x2 x3 / solution;

proc glm;		/* one-way analysis of variance */
class trt;
model y = trt;

proc glm;		/* additive two-factor anova */
class fert var;
model y = fert var;

proc glm;		/* full two-factor anova */
class fert var;
model y = fert | var;

proc glm;		/* analysis of covariance */
class trt;		/* trt = factor, x = covariate */
model y = x trt;

data testlin; set resps;
x = level;
proc glm;		/* test for non-linearity */
class level;
resp = x level;
```
The class phrase works like in proc anova. However, here we can have both categorical (identified in class) and continuous variables in the model. The model phrase indicates which variables are response (y) and which are predictors (x, or x1,x2,x3). You won't get parameter estimates (solution) if there is a class phrase unless you ask for them. Here are some options:
```   model y = trt x / solution;	/* print parameter estimates and SEs */
model y = x / noint;		/* no intercept (as in proc reg) */
model y = x / ss1;		/* print only type I sums of squares */
model y = x / ss2;		/* print only type II sums of squares */
model y = x / p;		/* print predicted values and residuals */
model y = x / clm;		/* option p plus 95% CI for estimated mean */
model y = x / cli;		/* option p plus 95% CI for predicted value */
model y = x / cli alpha=.01;	/* only .01, .05 and .10 available */
```
The default way of estimating model parameters in SAS is to set the last group estimate to 0. Thus if there are 3 treatment groups, the estimated mean for group 1 is the intercept plus the estimate for trt=1; for group 2 it is similar; for group 3, the estimated mean for group 3 is the intercept since the estimate for trt=3 is 0. This can be changed by another option.

The means phrase works much the same in proc glm as in proc anova. Contrasts can be set up if means aren't enough. Here is an example from the glue data. The contrast phrase contains a quoted title, variable name and the contrast coefficient values. Note that the order of factor levels is lexicographic, which may not be what you expect. This can be checked by examining the order under the solution option to the model phrase. Further, these can get very complicated for higher order designs. Consult a book for further help.

```   contrast 'A vs. rest' glue 1 -.25 -.25 -.25 -.25;
contrast 'BD vs. CE' glue 0 .5 -.5 .5 -.5;
```
Predicted and residual (and other) values can be passed to other procedures and data steps using the output phrase in the same manner as proc reg.

### Sums of Squares Types: I, II, III & IV

Issues of the choice of Sums of Squares arise with unbalanced designs including two or more factors or covariates. proc reg by default uses Type II sums of squares, while proc glm gives you Types I and II by default. There are four types of sums of squares (SS) available. Slight imbalance leads to slight differences in these SS, and hence in their tests. However, more severe imbalance can lead to fundamentally different conclusions (for rather different hypotheses)! Below are some notes; for more detail see the books by Littell, Freund & Spector and/or Milliken & Johnson.

#### Short Summary of Types of Sums of Squares

To examine these SS and associated tests, we must consider three experimental design scenarios (here a "cell" is a treatment combination):
1. balanced data (each cell has exactly r replicates)
2. unbalanced data but each cell has at least one observation
3. unbalanced data with one or more empty cells
WARNING: If you have (3) any empty cells, extreme care is needed to interpret ANY of the types -- get some help!! All four types give the same results for (1) (this is the same analysis as "proc anova"). For (2) Types III and IV test the balanced hypotheses, while the hypotheses for Types I and II depend on the number of observations per cell. Here are the SS for a 2-factor design:
```Source     Type I SS	    Type II SS	     Type III or IV SS
A        SS(A|u)	    SS(A|u,B)	     SS(A|B,AB)
B        SS(B|u,A)	    SS(B|u,A)	     SS(B|A,AB)
A*B       SS(A*B|u,A,B)    SS(A*B|u,A,B)    SS(AB|A,B)
```
Type I (sequential)
incremental improvement in the error SS as each effect is added to the model
Type II (hierarchical)
reduction in error SS due to adding the term to the model after all other terms except those that contain it
Type III (orthogonal)
reduction in error SS due to adding the term after all other terms have been added to the model
Type IV (balanced)
variation explained by balanced comparison of averages of cell means
Type I SS have certain advantages: Hypotheses depend on the order in which effects are specified only for Type I. SS for all the effects sum to the model SS only for Type I SS. Type I SS for polynomial models correspond to tests of orthogonal polynomials. Type I SS are preferable when some factors (such as blocking) should be taken out before other factors, even in an unbalanced design.

Type II approach is appropriate for model building, and is the natural choice for regression.

Type III and Type IV tests differ only if the design has empty cells. SAS automatically gives you Types I and III with proc glm. You can explicitly choose types with options to the model phrase:

```proc glm;
class a b;
model y = a b a*b / ss1 ss2 ss3 ss4;  /* select all 4 types */
```

#### Hypotheses for Unbalanced Data

With no empty cells, appropriate tests can be performed with Type III SS:
I/II
Hypotheses are functions of the cell counts (they differ from those tested if the data were balanced). This is usually undesirable. Type I hypotheses depend on order of terms in model.
III/IV
Hypotheses are the same for balanced and unbalanced data, involving simple, marginal averages of (population) cell means.
With one or more empty cells, main effects may not be what you think they are. Some marginal means are not defined. It is not generally obvious how to compare main effects.
I/II
Caution! Remember that hypotheses depend on cell counts.
III
Hypotheses do not depend on the order of effects or on the labels of levels. However, the orthogonal contrasts used are difficult to interpret unless you are willing to assume some interactions are zero.
IV
Hypotheses are balanced and easily interpretable. However, the SS may change if the labels of the factor levels are changed! Thus the exact tests performed depend on the order and labels of factor levels! Essentially, Type IV contrasts correspond to analysing subsets of factor levels chosen automatically.
What to do with empty cells? There is no easy answer to that one. My usual suggestion is something like this:
1. analyse with all data (use Type IV automated hypotheses)
2. analyse with all data (using Type II) for additive model
3. analyse combination of factors with missing cells as a single factor
4. pick subset(s) with no empty cells and analyse them
5. compare the subset analyses with the analysis in (1)
6. if the results are consistent, write them up
7. if they are not consistent--dig and find out why! (get help!)

#### General Form of Estimable Functions

This is an advanced topic. The listings of estimable functions in SAS are rather confusing. It is strongly recommended that you read section 4.3.7 of the SAS System for Linear Models book by Littell, Freund & Spector. The general form of estimable functions can obtained with options to the model phrase. Here are two examples:
```proc glm;
class a b;
model y = a | b / e;		/* general form of estimable functions */

proc glm;
class a b;			/* estimable function coefficients */
model y = a | b / e1 e2 e3;	/* for Types I, II, III */
```

### Stepwise Regression

There is a stepwise model selection regression method. It works something like doing a series of proc regs, but the computer automatically makes the model choices of entry and elimination. Watch out! Be sure you know what this is doing for you (and to you).
```proc stepwise;
model y = x1 x2 x3;
```
Here are model options for the means of selection and elimination:
```   model y = x1 x2 x3 / forward;	/* forward selection */
model y = x1 x2 x3 / backward;	/* backward elimination */
model y = x1 x2 x3 / stepwise;	/* forward in & backward out */
model y = x1 x2 x3 / maxr stop=4;	/* like stepwise, but using R^2 */
```
The cheapest methods are backward (or b) and forward (or f). The stepwise option (the default) is not much more costly, and a good idea in practice, as it checks back and forth. The maxr option is much more expensive, but does consider pairs of variables in ways possibly missed by stepwise; the stop=4 option to maxr only considers models with 4 or fewer variables, at considerable time savings. There is an alternative to maxr (called minr) which is even more costly.
```   model y = x1 x2 x3 / noint;	     /* no intercept */
model y = x1 x2 x3 / slentry=0.5; /* signif. level for selection */
model y = x1 x2 x3 / slstay=0.1;  /* signif. level for elimination */
model y = x1 x2 x3 / include=2;   /* force in first 2 variables */
model y = x1 x2 x3 / start=2;     /* start with 2 variables */
model y = x1 x2 x3 / details;     /* more details of R^2, F stats */
```
The significance levels slentry (or sle) and slstay (or sls) shown are the default ones (but sl=0.15 is used for selsection and elimination with stepwise option). The include option is useful if you want to force certain variables to always be in the model. The start option indicates how many must be in the model before elimination is considered (stepwise and maxr only).