Model Matrices in R

1 Accessing the model matrix
2 Some linear algebra functions applied to model matrices.
3 Linear algebra theory vs. numerical linear algebra practice
- 3.1 Matrix decompositions
- 3.2 Use of decompositions for fitting linear models
4 Use of contrasts for categorical covariates
5 Projections

1 Accessing the model matrix

Linear models and generalized linear models incorporate a model matrix, often written as X, in the model for the distribution of the response, y, as it depends on covariates. This matrix is sometimes called a design matrix but we will distinguish between a model matrix and a design matrix.

When we use an R function such as lm or aov or glm to fit a linear or a generalized linear model, the model matrix is created from the formula and data arguments automatically.

summary(fm1 <- lm(optden ~ carb, Formaldehyde))

Call:
lm(formula = optden ~ carb, data = Formaldehyde)

Residuals:
        1         2         3         4         5         6 
-0.006714  0.001029  0.002771  0.007143  0.007514 -0.011743 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.005086   0.007834   0.649    0.552    
carb        0.876286   0.013535  64.744 3.41e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.008649 on 4 degrees of freedom
Multiple R-squared: 0.999,      Adjusted R-squared: 0.9988 
F-statistic:  4192 on 1 and 4 DF,  p-value: 3.409e-07

model.matrix(fm1)

  (Intercept) carb
1           1  0.1
2           1  0.3
3           1  0.5
4           1  0.6
5           1  0.7
6           1  0.9
attr(,"assign")
[1] 0 1

We can also create a model matrix directly from the formula and data arguments if we wish to experiment with the representation of different models.

model.matrix(~ carb, Formaldehyde)

  (Intercept) carb
1           1  0.1
2           1  0.3
3           1  0.5
4           1  0.6
5           1  0.7
6           1  0.9
attr(,"assign")
[1] 0 1

(Note that we can use a one-sided formula because we don't need to know the response to be able to determine the model matrix. A two-sided formula also works as the left-hand side is ignored.)

We will denote the dimensions of X as n by p. That is, n is the number of observations and p is the number of coefficients in the model.

2 Some linear algebra functions applied to model matrices.

A common operation in linear models is to form the cross-product matrix, X'X, from X. It could be done as

X <- model.matrix(fm1)
t(X) %*% X


            (Intercept) carb
(Intercept)         6.0 3.10
carb                3.1 2.01

(the t function creates the transpose and the %*% function is matrix multiplication) but it is preferrable to use crossprod

(XtX <- crossprod(X))

            (Intercept) carb
(Intercept)         6.0 3.10
carb                3.1 2.01

As we can see, X'X is symmetric. It defines a quadratic form (see section 1.5 of the text), which will be positive semi-definite (appendix A.3) because, for any p-dimensional vector b

b'(X'X)b = (Xb)'(Xb) >= 0

since (Xb)'(Xb) is the squared length of Xb. When X has full column rank, meaning that the columns of X are linearly independent, then Xb is non-zero for any non-zero vector b, and X'X will be positive definite.

When X'X is positive definite its eigenvalues must be strictly positive. For small to moderate values of p it is reasonable to check this with the eigen

eigen(XtX)

$values
[1] 7.6914651 0.3185349

$vectors
           [,1]       [,2]
[1,] -0.8778294  0.4789735
[2,] -0.4789735 -0.8778294

As we see, the default for the eigen function is to produce both the eigenvalues and the matrix of eigenvectors. If we only want the eigenvalues it is slightly faster to use

eigen(XtX, symmetric=TRUE, only.values=TRUE)

$values
[1] 7.6914651 0.3185349

$vectors
NULL

although, for a 2 by 2 symmetric matrix the difference in execution time will be minuscule.

It turns out that we don't need to form X'X explicitly to evaluate whether or not it is positive definite. The singular value decomposition (appendix A.12) of X, obtained with the svd function provides the singular values of X, which are the square roots of the eigenvalues of X'X.

svd(X, nu=0, nv=0)

$d
[1] 2.773349 0.564389

$u
NULL

$v
NULL

(setting nu and nv to zero suppresses evaluation of the singular vectors, just like only.values in eigen).

As always happens with numerical operations, there is some round-off so we cannot expect exact equality of the eigenvalues and the squares of the singular values

sv <- svd(X, nu=0, nv=0)$d
ev <- eigen(XtX, symmetric=TRUE, only.values=TRUE)$values
all(sv^2 == ev)


> [1] FALSE

Here we are checking for exact equality of the floating-point numbers in the two vectors, and that fails.

However, if we do a relative comparison of the difference the vectors relative to the sizes of the vectors themselves

all.equal(sv^2, ev)

[1] TRUE

we find that they are effectively equal. With numeric vectors the all.equal function checks for "equality up to possible round-off error".

In general we do not expect numerical methods, including numerical linear algebra, to provide "exact" answers. We are using finite precision arithmetic and the answers we produce will be subject to round-off and other numerical errors.

2.1 Determinant

The determinant is a real-valued function of a square matrix that, in some sense, measures the size of the matrix. For the particular case of a model matrix, X, the determinant of X'X, written |X'X|, measures the volume of the parallelepiped spanned by the columns of X, which does have some applications in statistics.

There are two functions in R, det and determinant to evaluate this

det(XtX)

[1] 2.45

(dd <- determinant(XtX))

$modulus
[1] 0.896088
attr(,"logarithm")
[1] TRUE

$sign
[1] 1

attr(,"class")
[1] "det"

Notice that, by default, the determinant function returns the logarithm of the modulus and, separately, the sign. We need to exponentiate the logarithm of the modulus to get back the determinant.

exp(dd$modulus)

[1] 2.45
attr(,"logarithm")
[1] TRUE

We'll state without proof that the determinant of a matrix is the product of its eigenvalues

prod(ev)

[1] 2.45

which brings us back to the singular values of X. We can calculate the determinant of X'X directly from the singular values of X as

prod(sv^2)

[1] 2.45

2.2 Condition numbers

In practice we prefer to work with decompositions of X directly rather than forming X'X and then decomposing that matrix. The condition number of a matrix is, roughly, the extent to which it introduces numerical instability in calculations. It is the ratio of the largest to the smallest singular value. A matrix is well-conditioned if its condition number is close to 1 and ill-conditioned if its condition number is very large.

The condition number of X'X is

ev[1]/ev[2]

[1] 24.14638

which could also be evaluated as

kappa(XtX, exact=TRUE)

[1] 24.14638

By contrast, the condition number of X is

sv[1]/sv[2]

[1] 4.913897

which is, of course, the square root of the condition number of X'X.

Theoretically, the condition number of a singular matrix should be infinity. In practice, it comes out to be a large number. To judge exactly what "large" means, it is helpful to consider the reciprocal condition number,

rcond(X)

[1] 0.1720052

We regard a matrix as being "numerically singular" when its reciprocal condition number is less than the relative machine precision, ε

.Machine$double.eps

[1] 2.220446e-16

In practice we may apply a looser criterion, say less than 100* ε

kappa is a generic function in R and can be applied to a fitted model directly.

kappa(fm1, exact=TRUE)

[1] 4.913897

2.3 Rank

The rank of a square matrix is the number of linearly independent columns (or rows) in the matrix. For model matrices we are interested in the column rank which is the number of linearly independent columns. In particular, a model matrix has full column rank if its columns are all linearly independent.

In theory this is a well-defined property. In practice, it is not.

3 Linear algebra theory vs. numerical linear algebra practice

In theory, theory and practice are the same. In practice, they're not.

Numerical linear algebra is fundamental to statistical computing and especially to those computations associated with linear models or generalized linear models.

Many of the recent developments in numerical linear algebra occurred after the development of the statistical theory, with the result that those who write about theory as they learned it many years ago, tend to use constructions that don't correspond to current knowledge.

3.1 Matrix decompositions

Numerical linear algebra is an area where the theory/practice divide is particularly jarring. Just about everything done in numerical linear algebra is based on matrix decompositions and using such decompositions to simplify expressions. Other than the spectral (or eigenvalue/eigenvector) decomposition of a square matrix, decompositions are rarely mentioned in a linear algebra course.

In statistics the most important decompositions apply to a model matrix, X, or to the cross-product (in the sense of X'X, R function crossprod) of a model matrix.

Singular Value Decomposition: A = U D V' where U and V are orthogonal and D is zero off the main diagonal. (R function svd).
QR Decomposition: X = Q R (R function qr) where Q is orthogonal and R is upper triangular. If X has dimension n by p where n >= p then one form has Q of dimension n by n, sometimes written as two adjacent vertical sections Q = [Q₁ Q_2], and R is of size n by p. Often the R factor is written as a vertical concatenation of the p by p R₁ and a zero matrix. Thus X = Q R = Q₁ R₁.
Cholesky decomposition: A symmetric positive semidefinite matrix A can be factored as A = R'R = LL' where R is upper triangular and L is lower triangular. This is the same decomposition. It is just a matter of taste whether you want to talk about the factor on the left, L, or the one on the right, R, as the defining factor. Statisticians tend to use R. If A = X'X then the R from the Cholesky decomposition of A is the same as R₁ from the QR decomposition of X, up to changes in sign of rows of R. By convention, the diagonal elements of R in the Cholesky are chosen to be non-negative. (R function chol).

For general linear algebra applications the most important decomposition is

LU Decomposition: Decompose a square matrix A = LU where L is unit lower triangular and U is upper triangular. (R function Matrix::lu)

We also mentioned the

Spectral decomposition: For a general matrix the decomposition is A = Q D Q' where D is diagonal and may have complex elements on the diagonal. If A is symmetric then the eigenvalues (diagonal elements of D) are known to be real and the matrix Q can be taken to be orthogonal. That is we can write A = V D V' where V is orthogonal. Note that the spectral decomposition of X'X is related to the singular value decomposition (SVD) of X.

3.2 Use of decompositions for fitting linear models

Computational methods for fitting linear models could be based on the singular value decomposition but that is an iterative decomposition and could be expensive to evaluate for large models (i.e. both n and p are large).

The QR decomposition is a direct (i.e. non-iterative) decomposition. The default methods for lm and glm use the QR decomposition, created by the qr function.

str(qr(X))

List of 4
 $ qr   : num [1:6, 1:2] -2.449 0.408 0.408 0.408 0.408 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "(Intercept)" "carb"
  ..- attr(*, "assign")= int [1:2] 0 1
 $ rank : int 2
 $ qraux: num [1:2] 1.41 1.15
 $ pivot: int [1:2] 1 2
 - attr(*, "class")= chr "qr"

This object is included in the fitted model

str(fm1$qr)

List of 5
 $ qr   : num [1:6, 1:2] -2.449 0.408 0.408 0.408 0.408 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "1" "2" "3" "4" ...
  .. ..$ : chr [1:2] "(Intercept)" "carb"
  ..- attr(*, "assign")= int [1:2] 0 1
 $ qraux: num [1:2] 1.41 1.15
 $ pivot: int [1:2] 1 2
 $ tol  : num 1e-07
 $ rank : int 2
 - attr(*, "class")= chr "qr"

The QR object itself is somewhat opaque. Information on the matrices Q and R are embedded in it but not in an obvious way. There are several functions used to extract information from this object or to apply operations based on it to another object (see ?qr for a list).

qr.R(fm1$qr)

  (Intercept)       carb
1   -2.449490 -1.2655697
2    0.000000  0.6390097

The matrix R₁ from the QR decomposition is equivalent to R, the Cholesky decomposition of X'X, in the sense that both of them are upper triangular and R₁ 'R₁=R'R. However, there may be differences in signs.

chol(XtX)

            (Intercept)      carb
(Intercept)    2.449490 1.2655697
carb           0.000000 0.6390097

4 Use of contrasts for categorical covariates

We will see that categorical covariates are expanded into a set of indicator columns in a linear model. Often it is pointed out that a parameterization of an analysis of variance model, such as

y_ij = μ + α_i + ε_ij

will produce a rank-deficient model matrix if a separate column for each coefficient is introduced.

In R this rank deficiency is avoided by employing a set of contrasts for categorical covariates. By default the contrasts are contr.treatment for unordered factors and contr.poly for ordered

getOption("contrasts")

        unordered           ordered 
"contr.treatment"      "contr.poly"

That is, a model with a categorical factor is expressed in terms of α₁, α₂-α₁, α₃-α₁, …

summary(fm2 <- aov(sqrt(count) ~ spray, InsectSprays))

            Df Sum Sq Mean Sq F value    Pr(>F)    
spray        5 88.438 17.6876  44.799 < 2.2e-16 ***
Residuals   66 26.058  0.3948                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

coef(summary.lm(fm2))

              Estimate Std. Error    t value     Pr(>|t|)
(Intercept)  3.7606784  0.1813877 20.7328227 1.315091e-30
sprayB       0.1159530  0.2565209  0.4520217 6.527355e-01
sprayC      -2.5158217  0.2565209 -9.8074726 1.637405e-14
sprayD      -1.5963245  0.2565209 -6.2229803 3.797346e-08
sprayE      -1.9512174  0.2565209 -7.6064654 1.340328e-10
sprayF       0.2579388  0.2565209  1.0055273 3.183154e-01

Under this parameterization the model matrix is not rank deficient.

kappa(fm2, exact=TRUE)

[1] 6.854102

You can reconstruct the μ, α₁, … representation for simple models with categorical covariates with model.tables but you must use aov to fit the model instead of lm.

model.tables(fm2)

Tables of effects

 spray 
spray
      A       B       C       D       E       F 
 0.9482  1.0642 -1.5676 -0.6481 -1.0030  1.2062

model.tables(fm2, type="mean")

Tables of means
Grand mean
         
2.812433 

 spray 
spray
    A     B     C     D     E     F 
3.761 3.877 1.245 2.164 1.809 4.019

5 Projections

Least squares estimation and hypothesis testing with the analysis of variance or t-tests can be viewed geometrically as orthogonal projection of the response vector onto the column span of X. In practice we don't form the projection matrices but for illustration we can.

In the QR decomposition the n by n Q matrix is orthogonal and its first p columns, written Q₁, span the column space of X. The relationship Q'Q=I means that the columns of Q are orthonormal. That is they are all orthogonal to each other and all have length 1. This will also hold for the columns of Q₁, which are a subset of the columns of Q.

str(Q11 <- qr.Q(fm1$qr))

 num [1:6, 1:2] -0.408 -0.408 -0.408 -0.408 -0.408 ...

crossprod(Q11)

             [,1]         [,2]
[1,] 1.000000e+00 1.197637e-16
[2,] 1.197637e-16 1.000000e+00

We see that the result, having been calculated in floating-point arithmetic, is not exact. Sometimes the zapsmall function helps to make the result clearer.

zapsmall(crossprod(Q11))

     [,1] [,2]
[1,]    1    0
[2,]    0    1

The same is true for model fm2

zapsmall(crossprod(Q12 <- qr.Q(fm2$qr)))

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    0    0    0    0    0
[2,]    0    1    0    0    0    0
[3,]    0    0    1    0    0    0
[4,]    0    0    0    1    0    0
[5,]    0    0    0    0    1    0
[6,]    0    0    0    0    0    1

The matrix A = Q₁ Q_1' is a projection onto the column span of X. It is a projection because it is idempotent

AA = (Q₁ Q_1')(Q₁ Q_1') = Q₁ (Q_1'Q₁) Q_1' = Q₁ (I) Q_1' = Q₁ Q_1' = A

and symmetric. Its column span, which is the range of the projection, is the same as the column span of X.

The "effects" component of an lm or aov fitted model is Q'y.

str(fm2$effects)

 Named num [1:72] -23.86 4.04 -5.25 -3.1 -5.88 ...
 - attr(*, "names")= chr [1:72] "(Intercept)" "sprayB" "sprayC" "sprayD" ...

head(fm2$effects)

(Intercept)      sprayB      sprayC      sprayD      sprayE      sprayF 
-23.8642861   4.0383487  -5.2468729  -3.0956950  -5.8836567   0.6318184

all.equal(unname(fm2$effects), unname(qr.qty(fm2$qr, model.response(fm2$model))))

[1] TRUE

Interestingly, all the formulas that you learned for analysis of variance tables are not used. The analysis of variance table is constructed from the effects vector.

The least squares estimates are formed by solving R_1β = Q_1' y for β.

backsolve(qr.R(fm2$qr), fm2$effects[1:6])

[1]  3.7606784  0.1159530 -2.5158217 -1.5963245 -1.9512174  0.2579388

all.equal(unname(coef(fm2)), backsolve(qr.R(fm2$qr), fm2$effects[1:6]))

[1] TRUE

Author: Douglas Bates

Date: 2010-08-23 Mon

HTML generated by org-mode 7.01h in emacs 23