Model Matrices in R
Table of Contents
1 Accessing the model matrix
Linear models and generalized linear models incorporate a model matrix, often written as X, in the model for the distribution of the response, y, as it depends on covariates. This matrix is sometimes called a design matrix but we will distinguish between a model matrix and a design matrix.
When we use an R function such as lm
or aov
or glm
to fit a linear
or a generalized linear model, the model matrix is created from the
formula
and data
arguments automatically.
summary(fm1 <- lm(optden ~ carb, Formaldehyde))
Call: lm(formula = optden ~ carb, data = Formaldehyde) Residuals: 1 2 3 4 5 6 -0.006714 0.001029 0.002771 0.007143 0.007514 -0.011743 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.005086 0.007834 0.649 0.552 carb 0.876286 0.013535 64.744 3.41e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.008649 on 4 degrees of freedom Multiple R-squared: 0.999, Adjusted R-squared: 0.9988 F-statistic: 4192 on 1 and 4 DF, p-value: 3.409e-07
model.matrix(fm1)
(Intercept) carb 1 1 0.1 2 1 0.3 3 1 0.5 4 1 0.6 5 1 0.7 6 1 0.9 attr(,"assign") [1] 0 1
We can also create a model matrix directly from the formula
and
data
arguments if we wish to experiment with the representation of
different models.
model.matrix(~ carb, Formaldehyde)
(Intercept) carb 1 1 0.1 2 1 0.3 3 1 0.5 4 1 0.6 5 1 0.7 6 1 0.9 attr(,"assign") [1] 0 1
(Note that we can use a one-sided formula because we don't need to know the response to be able to determine the model matrix. A two-sided formula also works as the left-hand side is ignored.)
We will denote the dimensions of X as n by p. That is, n is the number of observations and p is the number of coefficients in the model.
2 Some linear algebra functions applied to model matrices.
A common operation in linear models is to form the cross-product matrix, X'X, from X. It could be done as
X <- model.matrix(fm1)
t(X) %*% X
(Intercept) carb (Intercept) 6.0 3.10 carb 3.1 2.01
(the t
function creates the transpose and the %*%
function is
matrix multiplication) but it is preferrable to use crossprod
(XtX <- crossprod(X))
(Intercept) carb (Intercept) 6.0 3.10 carb 3.1 2.01
As we can see, X'X is symmetric. It defines a quadratic form (see section 1.5 of the text), which will be positive semi-definite (appendix A.3) because, for any p-dimensional vector b
- b'(X'X)b = (Xb)'(Xb) >= 0
since (Xb)'(Xb) is the squared length of Xb. When X has full column rank, meaning that the columns of X are linearly independent, then Xb is non-zero for any non-zero vector b, and X'X will be positive definite.
When X'X is positive definite its eigenvalues must be strictly
positive. For small to moderate values of p it is reasonable to check
this with the eigen
eigen(XtX)
$values [1] 7.6914651 0.3185349 $vectors [,1] [,2] [1,] -0.8778294 0.4789735 [2,] -0.4789735 -0.8778294
As we see, the default for the eigen
function is to produce both the
eigenvalues and the matrix of eigenvectors. If we only want the
eigenvalues it is slightly faster to use
eigen(XtX, symmetric=TRUE, only.values=TRUE)
$values [1] 7.6914651 0.3185349 $vectors NULL
although, for a 2 by 2 symmetric matrix the difference in execution time will be minuscule.
It turns out that we don't need to form X'X explicitly to evaluate
whether or not it is positive definite. The singular value decomposition (appendix A.12) of X, obtained with the svd
function
provides the singular values of X, which are the square roots of the
eigenvalues of X'X.
svd(X, nu=0, nv=0)
$d [1] 2.773349 0.564389 $u NULL $v NULL
(setting nu
and nv
to zero suppresses evaluation of the singular
vectors, just like only.values
in eigen
).
As always happens with numerical operations, there is some round-off so we cannot expect exact equality of the eigenvalues and the squares of the singular values
sv <- svd(X, nu=0, nv=0)$d ev <- eigen(XtX, symmetric=TRUE, only.values=TRUE)$values all(sv^2 == ev)
> [1] FALSE
Here we are checking for exact equality of the floating-point numbers in the two vectors, and that fails.
However, if we do a relative comparison of the difference the vectors relative to the sizes of the vectors themselves
all.equal(sv^2, ev)
[1] TRUE
we find that they are effectively equal. With numeric vectors the
all.equal
function checks for "equality up to possible round-off
error".
In general we do not expect numerical methods, including numerical linear algebra, to provide "exact" answers. We are using finite precision arithmetic and the answers we produce will be subject to round-off and other numerical errors.
2.1 Determinant
The determinant is a real-valued function of a square matrix that, in some sense, measures the size of the matrix. For the particular case of a model matrix, X, the determinant of X'X, written |X'X|, measures the volume of the parallelepiped spanned by the columns of X, which does have some applications in statistics.
There are two functions in R, det
and determinant
to evaluate this
det(XtX)
[1] 2.45
(dd <- determinant(XtX))
$modulus [1] 0.896088 attr(,"logarithm") [1] TRUE $sign [1] 1 attr(,"class") [1] "det"
Notice that, by default, the determinant
function returns the
logarithm of the modulus and, separately, the sign. We need to
exponentiate the logarithm of the modulus to get back the
determinant.
exp(dd$modulus)
[1] 2.45 attr(,"logarithm") [1] TRUE
We'll state without proof that the determinant of a matrix is the product of its eigenvalues
prod(ev)
[1] 2.45
which brings us back to the singular values of X. We can calculate the determinant of X'X directly from the singular values of X as
prod(sv^2)
[1] 2.45
2.2 Condition numbers
In practice we prefer to work with decompositions of X directly rather than forming X'X and then decomposing that matrix. The condition number of a matrix is, roughly, the extent to which it introduces numerical instability in calculations. It is the ratio of the largest to the smallest singular value. A matrix is well-conditioned if its condition number is close to 1 and ill-conditioned if its condition number is very large.
The condition number of X'X is
ev[1]/ev[2]
[1] 24.14638
which could also be evaluated as
kappa(XtX, exact=TRUE)
[1] 24.14638
By contrast, the condition number of X is
sv[1]/sv[2]
[1] 4.913897
which is, of course, the square root of the condition number of X'X.
Theoretically, the condition number of a singular matrix should be infinity. In practice, it comes out to be a large number. To judge exactly what "large" means, it is helpful to consider the reciprocal condition number,
rcond(X)
[1] 0.1720052
We regard a matrix as being "numerically singular" when its reciprocal condition number is less than the relative machine precision, ε
.Machine$double.eps
[1] 2.220446e-16
In practice we may apply a looser criterion, say less than 100* ε
kappa
is a generic function in R and can be applied to a fitted model directly.
kappa(fm1, exact=TRUE)
[1] 4.913897
2.3 Rank
The rank of a square matrix is the number of linearly independent columns (or rows) in the matrix. For model matrices we are interested in the column rank which is the number of linearly independent columns. In particular, a model matrix has full column rank if its columns are all linearly independent.
In theory this is a well-defined property. In practice, it is not.
3 Linear algebra theory vs. numerical linear algebra practice
In theory, theory and practice are the same. In practice, they're not.
Numerical linear algebra is fundamental to statistical computing and especially to those computations associated with linear models or generalized linear models.
Many of the recent developments in numerical linear algebra occurred after the development of the statistical theory, with the result that those who write about theory as they learned it many years ago, tend to use constructions that don't correspond to current knowledge.
3.1 Matrix decompositions
Numerical linear algebra is an area where the theory/practice divide is particularly jarring. Just about everything done in numerical linear algebra is based on matrix decompositions and using such decompositions to simplify expressions. Other than the spectral (or eigenvalue/eigenvector) decomposition of a square matrix, decompositions are rarely mentioned in a linear algebra course.
In statistics the most important decompositions apply to a model
matrix, X, or to the cross-product (in the sense of X'X, R function
crossprod
) of a model matrix.
- Singular Value Decomposition
-
A = U D V' where U and V are
orthogonal and D is zero off the main diagonal. (R function
svd
). - QR Decomposition
-
X = Q R (R function
qr
) where Q is orthogonal and R is upper triangular. If X has dimension n by p where n >= p then one form has Q of dimension n by n, sometimes written as two adjacent vertical sections Q = [Q1 Q2], and R is of size n by p. Often the R factor is written as a vertical concatenation of the p by p R1 and a zero matrix. Thus X = Q R = Q1 R1. - Cholesky decomposition
-
A symmetric positive semidefinite matrix A
can be factored as A = R'R = LL' where R is upper triangular and
L is lower triangular. This is the same decomposition. It is
just a matter of taste whether you want to talk about the factor
on the left, L, or the one on the right, R, as the defining
factor. Statisticians tend to use R. If A = X'X then the R from
the Cholesky decomposition of A is the same as R1 from the QR
decomposition of X, up to changes in sign of rows of R. By
convention, the diagonal elements of R in the Cholesky are chosen
to be non-negative. (R function
chol
).
For general linear algebra applications the most important decomposition is
- LU Decomposition
-
Decompose a square matrix A = LU where L is
unit lower triangular and U is upper triangular. (R function
Matrix::lu
)
We also mentioned the
- Spectral decomposition
- For a general matrix the decomposition is A = Q D Q' where D is diagonal and may have complex elements on the diagonal. If A is symmetric then the eigenvalues (diagonal elements of D) are known to be real and the matrix Q can be taken to be orthogonal. That is we can write A = V D V' where V is orthogonal. Note that the spectral decomposition of X'X is related to the singular value decomposition (SVD) of X.
3.2 Use of decompositions for fitting linear models
Computational methods for fitting linear models could be based on the singular value decomposition but that is an iterative decomposition and could be expensive to evaluate for large models (i.e. both n and p are large).
The QR decomposition is a direct (i.e. non-iterative) decomposition.
The default methods for lm
and glm
use the QR decomposition,
created by the qr
function.
str(qr(X))
List of 4 $ qr : num [1:6, 1:2] -2.449 0.408 0.408 0.408 0.408 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:6] "1" "2" "3" "4" ... .. ..$ : chr [1:2] "(Intercept)" "carb" ..- attr(*, "assign")= int [1:2] 0 1 $ rank : int 2 $ qraux: num [1:2] 1.41 1.15 $ pivot: int [1:2] 1 2 - attr(*, "class")= chr "qr"
This object is included in the fitted model
str(fm1$qr)
List of 5 $ qr : num [1:6, 1:2] -2.449 0.408 0.408 0.408 0.408 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:6] "1" "2" "3" "4" ... .. ..$ : chr [1:2] "(Intercept)" "carb" ..- attr(*, "assign")= int [1:2] 0 1 $ qraux: num [1:2] 1.41 1.15 $ pivot: int [1:2] 1 2 $ tol : num 1e-07 $ rank : int 2 - attr(*, "class")= chr "qr"
The QR object itself is somewhat opaque. Information on the matrices
Q and R are embedded in it but not in an obvious way. There are
several functions used to extract information from this object or to
apply operations based on it to another object (see ?qr
for a list).
qr.R(fm1$qr)
(Intercept) carb 1 -2.449490 -1.2655697 2 0.000000 0.6390097
The matrix R1 from the QR decomposition is equivalent to R, the Cholesky decomposition of X'X, in the sense that both of them are upper triangular and R1 'R1=R'R. However, there may be differences in signs.
chol(XtX)
(Intercept) carb (Intercept) 2.449490 1.2655697 carb 0.000000 0.6390097
4 Use of contrasts for categorical covariates
We will see that categorical covariates are expanded into a set of indicator columns in a linear model. Often it is pointed out that a parameterization of an analysis of variance model, such as
- yij = μ + αi + εij
will produce a rank-deficient model matrix if a separate column for each coefficient is introduced.
In R this rank deficiency is avoided by employing a set of contrasts
for categorical covariates. By default the contrasts are
contr.treatment
for unordered factors and contr.poly
for ordered
getOption("contrasts")
unordered ordered "contr.treatment" "contr.poly"
That is, a model with a categorical factor is expressed in terms of α1, α2-α1, α3-α1, …
summary(fm2 <- aov(sqrt(count) ~ spray, InsectSprays))
Df Sum Sq Mean Sq F value Pr(>F) spray 5 88.438 17.6876 44.799 < 2.2e-16 *** Residuals 66 26.058 0.3948 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
coef(summary.lm(fm2))
Estimate Std. Error t value Pr(>|t|) (Intercept) 3.7606784 0.1813877 20.7328227 1.315091e-30 sprayB 0.1159530 0.2565209 0.4520217 6.527355e-01 sprayC -2.5158217 0.2565209 -9.8074726 1.637405e-14 sprayD -1.5963245 0.2565209 -6.2229803 3.797346e-08 sprayE -1.9512174 0.2565209 -7.6064654 1.340328e-10 sprayF 0.2579388 0.2565209 1.0055273 3.183154e-01
Under this parameterization the model matrix is not rank deficient.
kappa(fm2, exact=TRUE)
[1] 6.854102
You can reconstruct the μ, α1, … representation for simple
models with categorical covariates with model.tables
but you must
use aov
to fit the model instead of lm
.
model.tables(fm2)
Tables of effects spray spray A B C D E F 0.9482 1.0642 -1.5676 -0.6481 -1.0030 1.2062
model.tables(fm2, type="mean")
Tables of means Grand mean 2.812433 spray spray A B C D E F 3.761 3.877 1.245 2.164 1.809 4.019
5 Projections
Least squares estimation and hypothesis testing with the analysis of variance or t-tests can be viewed geometrically as orthogonal projection of the response vector onto the column span of X. In practice we don't form the projection matrices but for illustration we can.
In the QR decomposition the n by n Q matrix is orthogonal and its first p columns, written Q1, span the column space of X. The relationship Q'Q=I means that the columns of Q are orthonormal. That is they are all orthogonal to each other and all have length 1. This will also hold for the columns of Q1, which are a subset of the columns of Q.
str(Q11 <- qr.Q(fm1$qr))
num [1:6, 1:2] -0.408 -0.408 -0.408 -0.408 -0.408 ...
crossprod(Q11)
[,1] [,2] [1,] 1.000000e+00 1.197637e-16 [2,] 1.197637e-16 1.000000e+00
We see that the result, having been calculated in floating-point
arithmetic, is not exact. Sometimes the zapsmall
function helps to
make the result clearer.
zapsmall(crossprod(Q11))
[,1] [,2] [1,] 1 0 [2,] 0 1
The same is true for model fm2
zapsmall(crossprod(Q12 <- qr.Q(fm2$qr)))
[,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 0 0 0 0 0 [2,] 0 1 0 0 0 0 [3,] 0 0 1 0 0 0 [4,] 0 0 0 1 0 0 [5,] 0 0 0 0 1 0 [6,] 0 0 0 0 0 1
The matrix A = Q1 Q1' is a projection onto the column span of X. It is a projection because it is idempotent
- AA = (Q1 Q1')(Q1 Q1') = Q1 (Q1'Q1) Q1' = Q1 (I) Q1' = Q1 Q1' = A
and symmetric. Its column span, which is the range of the projection, is the same as the column span of X.
The "effects" component of an lm
or aov
fitted model is Q'y.
str(fm2$effects)
Named num [1:72] -23.86 4.04 -5.25 -3.1 -5.88 ... - attr(*, "names")= chr [1:72] "(Intercept)" "sprayB" "sprayC" "sprayD" ...
head(fm2$effects)
(Intercept) sprayB sprayC sprayD sprayE sprayF -23.8642861 4.0383487 -5.2468729 -3.0956950 -5.8836567 0.6318184
all.equal(unname(fm2$effects), unname(qr.qty(fm2$qr, model.response(fm2$model))))
[1] TRUE
Interestingly, all the formulas that you learned for analysis of variance tables are not used. The analysis of variance table is constructed from the effects vector.
The least squares estimates are formed by solving R1β = Q1' y for β.
backsolve(qr.R(fm2$qr), fm2$effects[1:6])
[1] 3.7606784 0.1159530 -2.5158217 -1.5963245 -1.9512174 0.2579388
all.equal(unname(coef(fm2)), backsolve(qr.R(fm2$qr), fm2$effects[1:6]))
[1] TRUE
Date: 2010-08-23 Mon
HTML generated by org-mode 7.01h in emacs 23