STAT340 Lecture 14: Recap and Exam Review

In this, our final lecture of the semester, we will review the tools that we have discussed this semester and get a few glances at what is to come in your future courses.

Probability and Random Variables

We started our semester with a discussion of random variables and basic ideas from probability theory.

Some concepts that you should review:

What is a random variable?
The definition of independently and identically distributed (i.i.d.) data
What is a parameter of a model or distribution?
Densities and cumulative distribution functions
Generating common random variables (e.g., normal, Poisson, uniform, exponential, etc) in R using built-in functions (e.g., rnorm, rpois, etc.)
Generating random variables using the inverse trick: given a density or distribution function, write a function to generate copies of a random variable with that density or distribution function.

Monte Carlo

The workhorse of our whole semester was the Monte Carlo framework. At a high level, Monte Carlo is a way to estimate expectations by replacing an integral or sum with a mean.

Example: Suppose that we have picked up a bent coin from the ground and we wish to estimate the probability \(p\) that this coin lands heads. One way to estimate \(p\) is to flip the coin many times and count how often it lands heads.

Example: We are interested in the value of \(\mathbb{E} g(X)\), where -\(g\) is a function that we know how to compute, and -\(X\) is a random variable that we can generate

One way to compute \(\mathbb{E} f(X)\) would be to evaluate the integral \[ \int_{\mathcal{X}} g(t) f_X(t) dt, \] where \(f_X(t)\) is the density of the variable \(X\) and \(\mathcal{X}\) is the support of \(X\). If \(X\) is a discrete random variable, that integral gets replaced with a sum: \[ \sum_{t \in \mathcal{X}} g(t) \Pr[ X=t], \]

Monte Carlo avoids performing a complicated sum or integral and instead estimates \(\mathbb{E} g(X)\) by taking advantage of the intuitive definition of an expectation as a long-run average. We estimate the expectation as \[ \frac{1}{M} \sum_{i=1}^M g( X_i ), \]

where \(X_1,X_2,\dots,X_M\) are independent copies of our random variable \(X\).

You should be able to:

Explain the basic ideas underlying Monte Carlo
Use Monte Carlo to estimate the probability of an event
Use Monte Carlo to estimate an expectation

Hypothesis testing

Our first real statistical problem considered this semester was hypothesis testing: we assume a model for our data (called the null model), and we wish to assess how reasonable this model is as an explanation for our observed data.

Example: we started our discussion of hypothesis testing by talking about Muriel Bristol, a.k.a. “the lady tasting tea”. Bristol was asked to guess, from a collection of eight cups of tea, which four cups were prepared by pouring milk into the cup before the tea, and which had milk added to the tea afterwards. When subjected to this experiment, Bristol correctly identified which cups of tea were prepared milk-first and which were tea-first, and our question was whether Bristol was guessing at random or not. Our null hypothesis was that Bristol was guessing at random, and we wanted to assess how “surprising” it was that Bristol guessed as well as she did if it were true that she was guessing at random.

The above sketch gets at the core idea behind p-values: we want a number that describes how “surprised” we are by a particular experimental result, if the null hypothesis were true.

We saw two basic ways to perform hypothesis tests:

Permutation testing: our null hypothesis is that data is “at random” or has some particular independence structure (e.g., assigning people at random to receive a placebo or treatment).
Model-based tests: our null hypothesis is that our data came from a particular model with a particular parameter and assess how likely the data was under that model

Some things to review:

Definition of a p-value
Test statistics, their definition and relation to p-values
Type I and type II errors and trade-offs between them
Definition of the size of a test; testing at level \(\alpha\)
One- and two-sided testing, and when each is appropriate

Conditional Probability and Bayes’ Rule

An important idea in probability concerns conditional probability: the probability of some event given that some other event has occurred.

Example: the most common illustrative example of conditional probability concerns disease screening: we have a test for some disease, and a patient tests positive under that screening test. In light of this result, what is the probability that our patient has the disease? That is, what is \[ \Pr[ \text{disease} \mid \text{positive test}] \] We can compute this using Bayes’ rule: for events \(A\) and \(B\), with \(\Pr[ A ]\) and \(\Pr[B]\) both positive, \[ \Pr[ A \mid B ] = \frac{ \Pr[ B \mid A ] \Pr[ A ] }{ \Pr[ B ] }. \]

Some things to review:

Definition of conditional probability: \(\Pr[A \mid B] = \Pr[ A, B ] / \Pr[ B ]\)
Using Bayes’ rule to compute \(\Pr[A \mid B]\) for events \(A\) and \(B\)
Estimating conditional and joint probabilities from count data

Estimation and Confidence Intervals

After discussing hypothesis testing, we considered the problem of estimation, in which there is a quantity out there in nature that we are interested in “guessing”. Often, this quantity can be interpreted as the parameter of a model.

Example: Suppose we trying to estimate how many customers contact a call center during a given span of time (e.g., between 2pm and 5pm on a Thursday). A common model for this sort of problem is to model the number of customers as a Poisson random variable. The mean of this distribution is the rate parameter \(\lambda\), and so estimating the average number of customers to contact our call center is equivalent to estimating the parameter \(\lambda\).

Rather than just “guessing” a parameter or other quantity, we typically want to associate a measure of uncertainty with our estimate. In this class, we did that using a confidence interval (CI): a random (i.e., data-dependent) interval that has the guarantee that averaging over many data sets, some percentage of the time, the interval will contain the true value of the parameter.

We saw three basic ways to build confidence intervals this semester:

Simulation-based CIs: we estimate the parameters of our model, then generate “fake” data sets from our model, and apply our statistic of interest to that data. The resulting (approximate) Monte Carlo replicates of our statistic of interest are used to construct a CI
CLT-based CIs: we appeal to the central limit theorem, which says that the sample mean \(\bar{X} \sum_i X_i/n\) of a collection of i.i.d. random variables \(X_1,X_2,\dots,X_n\) is approximately normal with mean given by \(\mu=\mathbb{E} X_1\) and its expectation and variance given by \(\sigma^2/n\), where \(n\) is the sample size and \(\sigma^2 = \operatorname{Var} X_1\). We estimate \(\mu\) and \(\sigma^2\) from the data, usually via the sample mean and variance, respectively.
Bootstrap-based CIs: later in the semester, we saw a different approach to estimating the variance \(\sigma^2\), namely, using the bootstrap. We’ll discuss that more below.

Some other concepts to review:

Building a confidence interval: You should be able to build a \(100*(1-\alpha)\)-percent confidence interval for a given quantity (e.g., a mean or other parameter) at a given level \(\alpha\) using any of the above three methods.
Definition of a confidence interval. A \(P\)-percent CI is a random interval \(C= (L, U)\), which has the property that for a target parameter \(\theta\), \[ \Pr[ L \le \theta \le U ] = \frac{ P }{ 100 }. \]
“Duality” of confidence intervals and testing: any acceptance region for a level-\(\alpha\) test can be turned into a level-\(100*(1-\alpha)\) confidence interval and vice versa.

Linear Regression

The last third of our course was dedicated to the task of prediction, in which we aim to guess a label or “response” associated to one or more “predictor” variables.

Example: Our initial motivating example was (fictional) data in which we had collected, for each of a number of survey participants, two pieces of data: 1) how many years of education they had, 2) their yearly salary. A natural goal is to try to predict, from the number of years of education, that person’s income.

The simplest model for settings such as the one above is simple linear regression, in which we model our response (a.k.a. dependent variable) \(Y\) as a linear function (okay, technically an affine function) of the predictor (a.k.a. independent variable) \(X\): \[ Y = \beta_0 + \beta_1 X + \epsilon, \] where \(\epsilon\) is a mean-zero normal random variable with variance \(\sigma^2\).

Given a collection of observations \((X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\), our goal is to

Estimate \(\beta_0\) and \(\beta_1\)
Use those estimates to predict the responses for other values of the predictors

We then discussed the case of multiple linear regression, in which we have more than one predictor variable for each of our observations.

Example: we might have a number of different pieces of information about a house (e.g., square footage, age of the house, school district, proximity to a park, etc), and our goal might be to predict the price of that house.

Some things to review:

Least squares estimation in simple linear regression
Definition of the residual sum of squares
Using the lm function to fit a linear model (either simple or multiple linear regression)
Interpretation of estimated coefficients, their p-value and confidence intervals
Interpretation/definition of interaction terms in linear models and how to fit a model with an interaction term

Logistic Regression

One problem with linear regression is that it predicts that our response is… well, linear. That is, our predicted response is a linear function of the data. We saw that this is not so appropriate when our response is, for example, binary. In such situations, logistic regression is more appropriate.

Logistic regression models our observed response \(Y\) as a Bernoulli random variable, with success probability determined by a linear function of the predictors: \[ \Pr[ Y=1 ] = \sigma( \beta_0 + \beta_1 x_1 + beta_2 x_2 + \beta_p x_p ), \]

where \(\sigma( \cdot )\) is the logistic function, \[ \sigma( z ) = \frac{ 1 }{ 1 + \exp\{ -z \} }. \]

Some things to review:

Definition of logistic regression and why we might choose it over linear regression
Fitting logistic regression to a given data set using glm
Interpreting fitted coefficients, their confidence intervals and associated p-values
Maximum likelihood estimation of the coefficients and how it differs from least-squares estimation used in linear regression

Cross-Validation and model selection

One question that was left largely unaddressed in our first few weeks of discussion of regression and prediction was how we decided what variables to include in our model to begin with. This decision is the problem of variable selection, an example of the more general problem of model selection.

Example: recalling our housing example, there are many different pieces of information that we could collect about a house for use in predicting its price. If we try to use every available piece of information, then we are liable to over-fit our data (unless we have a truly huge number of observations). How do we decide which pieces of information to include in our model and which to ignore?

Answering the model selection problem requires that we be able to compare different models fitted to the same data. There are a number of different ways to do this, but we saw two basic tools: cross-validation and feature selection methods.

Cross-validation (CV)

Cross-validation is a general tool for comparing models. What we want is an estimate of how well our model would do if we used it to do prediction on previously unseen data. To do that, cross-validation trains our model on some of our data, and then evaluates its performance on the rest of our data, which was not used in our training of the model.

We discussed three different approaches to CV:

a “naive” approach that split the data in two
Leave-one-out cross validation (LOOCV), in which we iteratively took all but one data point, set that one data point aside trained a model on the \(n-1\) remaining points, and then evaluated our model’s performance in predicting the held-out data point
\(K\)-fold cross validation, in which we split our data into \(K\) “folds” and then iteratively set each fold aside, train a model on the remaining \(K-1\) folds of data, and evaluate our model’s performance on the held-out fold.

Some things to review:

You should be able to implement any of the three CV approaches outlined above (indeed, note that they can all be obtained by writing one function!)
Explain why \(K\)-fold CV is preferable to the “naive” half-and-half split and to leave-one-out CV
Use cross-validation to choose among a collection of models

Feature selection and regularization

Another set of tools for choosing among a class of regression models is feature selection. Rather than trying to use cross-validation, which requires setting aside data to perform model fitting, feature selection methods attempt to compare multiple model simultaneously.

You should be able to

Explain (but not necessarily implement) the basic ideas behind subset selection and step-wise feature selection (both forward and backward)
Explain the basic ideas behind ridge regression and the LASSO, including the role of the regularization parameter \(\lambda\)
Apply ridge regression and/or the LASSO to a given data set

The Bootstrap

Our last full week of lecture concerned the bootstrap, a method for estimating the variance of a statistic when simpler approaches (e.g., the sample variance or a plug-in approach) are not feasible.

The basic setup is that we have a parameter \(\theta\), and an estimator of that parameter \[ \hat{\theta} = \hat{\theta}(X_1,X_2,\dots,X_n). \]

We know that \(\hat{\theta}\) is approximately normal, with mean \(\theta\) and variance \(\nu^2/n\), so that if we knew \(\nu^2\), we could compute a confidence interval based \[ \hat{\theta} \sim N( \theta, \frac{\nu^2}{n}). \]

Unfortunately, in many situations, \(\nu\) is not easy to estimate. The bootstrap estiamtes this variance by sampling with replacement from the data sample \(X_1,X_2,\dots,X_n\) to obtain a bootstrap sample \(X_1^*, X_2^*, \dots, X_n^*\), on which we can compute our statistic to obtain a bootstrap replicate \[ \hat{\theta}^* = \hat{\theta}( X_1^*, X_2^*, \dots, X_n^* ). \]

Repeating this procedure multiple times to obtain replicates \(\hat{\theta}^*_1, \hat{\theta}^*_2, \dots, \hat{\theta}^*_B\), we can estimate \(\nu^2/n\) according to \[ \frac{1}{B-1} \sum_{b=1}^B \left( \hat{\theta}^*_b - \frac{1}{B} \sum_i \hat{\theta}^*_i \right)^2. \]

Some things to review:

You should be able to give a high-level explanation as to why the bootstrap works.
You should be able to apply the bootstrap to construct a confidence interval associated with a simple statistic (e.g., a sample mean)
You should be able to give an example of an estimation problem and associated statistic for which the bootstrap fails

STAT340 Lecture 14: Recap and Exam Review

Keith Levin

Fall 2022

Probability and Random Variables

Monte Carlo

Hypothesis testing

Conditional Probability and Bayes’ Rule

Estimation and Confidence Intervals

Linear Regression

Logistic Regression

Cross-Validation and model selection

Cross-validation (CV)

Feature selection and regularization

The Bootstrap