The Bootstrap

You are responsible for material on the bootstrap that we discussed in class and from the handout from the book An Introduction to the Bootstrap by Efron and Tibshirani. These notes will clarify the main ideas.

The bootstrap is a general technique for assessing uncertainty in estimation procedures in which computer simulation through resampling data replaces mathematical analysis. We will focus on using the bootstrap to attach a standard error to an estimated parameter, although there are many other tasks the bootstrap can solve.

Suppose that you are interested in estimating the mean from an unknown population on the basis of randomly sampled data. The sample mean is the natural estimate, but we also wish to assess the amount of uncertainty in this estimate. A measure of this uncertainty is the standard error, the standard deviation of the sampling distribution of the sample mean. Theory tells us that the standard error of the sample mean equals the population standard deviation divided by the square root of the sample size. In cases in which the population size is unknown, we may use the sample standard deviation instead of the population standard deviation. For normal shaped populations or large samples from nonnormal populations, we may also conclude that the shape of the sampling distribution is approximately normal, which allows the computation of confidence intervals.

Consider now the problem of estimating the population median. Again, the sample median is a natural estimate. Now, however, you may not be aware of a nice formula for finding the standard error. Simulation will allow us to estimate this number without great mathematical overhead.

In principle, the ideal way to estimate the standard error of the sample median would be to take a very large number of samples of the original size from the population, compute the sample median of each, and use the standard deviation of this large collection of simulated sample medians as an estimate of the true standard error. Unfortunately, we do not have the ability to sample repeatedly from the population. We can, however, sample repeatedly from our original sample, which is itself an estimate of the population. This is how the bootstrap works.

Take the original sample of size n from the population of interest.
Compute the desired sample statistic (such as the median).
From the original sample, resample with replacement a bootstrap sample of size n. Some numbers in the original sample may be included several times in the bootstrap sample. Others may be excluded. This creates a bootstrap data set of the same size as the original.
Apply the estimation procedure to the bootstrap sample and store this value.
Repeat steps 3 and 4 B times and store all results. The estimated standard error is the standard deviation of the B separate estimates. For estimating a standard error, a number like B = 200 is usually sufficiently high.
If a histogram of the bootstrap estimates is approximately normal in shape, you may use normal theory to find confidence intervals for the unknown parameter. If the shape is not normal, the sampling distribution is not normal and more advanced techniques are needed to find a confidence interval. However, the bootstrap-generated standard error is still an able measure of the variability in the estimation procedure.

The Bootstrap in S-PLUS

It is easy to implement the bootstrap in S-PLUS. Consider again finding a standard error for the sample median by the bootstrap from a sample stored in the array x. This piece of code resamples a bootstrap sample from x and computes the median.

> median.boot <- median(sample(x),replace=T)

Putting this into a loop allows you to implement the bootstrap.

Here is a more efficient piece of code from page 399 of An Introduction to the Bootstrap which implements the bootstrap for a general single sample problem.

> bootstrap <- function(x,nboot,theta,...){
+ data <- matrix(sample(x,size=length(x)*nboot,replace=T),nrow=nboot)
+ return(apply(data,1,theta,...))
}

To use this code to return the bootstrap sample medians from 200 bootstrap samples from array x:

> out <- bootstrap(x,200,median)

If the statistic you wish to compute does not exist, you need to create an S-PLUS function which computes the estimate you want from the sampled data.

Last modified: May 2, 1997

Bret Larget, larget@mathcs.duq.edu