In this, our final lecture of the semester, we will review the tools that we have discussed this semester and get a few glances at what is to come in your future courses.

Probability and Random Variables

We started our semester with a discussion of random variables and basic ideas from probability theory.

Some concepts that you should review:

Monte Carlo

The workhorse of our whole semester was the Monte Carlo framework. At a high level, Monte Carlo is a way to estimate expectations by replacing an integral or sum with a mean.

Example: Suppose that we have picked up a bent coin from the ground and we wish to estimate the probability \(p\) that this coin lands heads. One way to estimate \(p\) is to flip the coin many times and count how often it lands heads.

Example: We are interested in the value of \(\mathbb{E} g(X)\), where -\(g\) is a function that we know how to compute, and -\(X\) is a random variable that we can generate

One way to compute \(\mathbb{E} f(X)\) would be to evaluate the integral \[ \int_{\mathcal{X}} g(t) f_X(t) dt, \] where \(f_X(t)\) is the density of the variable \(X\) and \(\mathcal{X}\) is the support of \(X\). If \(X\) is a discrete random variable, that integral gets replaced with a sum: \[ \sum_{t \in \mathcal{X}} g(t) \Pr[ X=t], \]

Monte Carlo avoids performing a complicated sum or integral and instead estimates \(\mathbb{E} g(X)\) by taking advantage of the intuitive definition of an expectation as a long-run average. We estimate the expectation as \[ \frac{1}{M} \sum_{i=1}^M g( X_i ), \]

where \(X_1,X_2,\dots,X_M\) are independent copies of our random variable \(X\).

You should be able to:

Hypothesis testing

Our first real statistical problem considered this semester was hypothesis testing: we assume a model for our data (called the null model), and we wish to assess how reasonable this model is as an explanation for our observed data.

Example: we started our discussion of hypothesis testing by talking about Muriel Bristol, a.k.a. “the lady tasting tea”. Bristol was asked to guess, from a collection of eight cups of tea, which four cups were prepared by pouring milk into the cup before the tea, and which had milk added to the tea afterwards. When subjected to this experiment, Bristol correctly identified which cups of tea were prepared milk-first and which were tea-first, and our question was whether Bristol was guessing at random or not. Our null hypothesis was that Bristol was guessing at random, and we wanted to assess how “surprising” it was that Bristol guessed as well as she did if it were true that she was guessing at random.

The above sketch gets at the core idea behind p-values: we want a number that describes how “surprised” we are by a particular experimental result, if the null hypothesis were true.

We saw two basic ways to perform hypothesis tests:

Some things to review:

Conditional Probability and Bayes’ Rule

An important idea in probability concerns conditional probability: the probability of some event given that some other event has occurred.

Example: the most common illustrative example of conditional probability concerns disease screening: we have a test for some disease, and a patient tests positive under that screening test. In light of this result, what is the probability that our patient has the disease? That is, what is \[ \Pr[ \text{disease} \mid \text{positive test}] \] We can compute this using Bayes’ rule: for events \(A\) and \(B\), with \(\Pr[ A ]\) and \(\Pr[B]\) both positive, \[ \Pr[ A \mid B ] = \frac{ \Pr[ B \mid A ] \Pr[ A ] }{ \Pr[ B ] }. \]

Some things to review:

Estimation and Confidence Intervals

After discussing hypothesis testing, we considered the problem of estimation, in which there is a quantity out there in nature that we are interested in “guessing”. Often, this quantity can be interpreted as the parameter of a model.

Example: Suppose we trying to estimate how many customers contact a call center during a given span of time (e.g., between 2pm and 5pm on a Thursday). A common model for this sort of problem is to model the number of customers as a Poisson random variable. The mean of this distribution is the rate parameter \(\lambda\), and so estimating the average number of customers to contact our call center is equivalent to estimating the parameter \(\lambda\).

Rather than just “guessing” a parameter or other quantity, we typically want to associate a measure of uncertainty with our estimate. In this class, we did that using a confidence interval (CI): a random (i.e., data-dependent) interval that has the guarantee that averaging over many data sets, some percentage of the time, the interval will contain the true value of the parameter.

We saw three basic ways to build confidence intervals this semester:

Some other concepts to review:

Linear Regression

The last third of our course was dedicated to the task of prediction, in which we aim to guess a label or “response” associated to one or more “predictor” variables.

Example: Our initial motivating example was (fictional) data in which we had collected, for each of a number of survey participants, two pieces of data: 1) how many years of education they had, 2) their yearly salary. A natural goal is to try to predict, from the number of years of education, that person’s income.

The simplest model for settings such as the one above is simple linear regression, in which we model our response (a.k.a. dependent variable) \(Y\) as a linear function (okay, technically an affine function) of the predictor (a.k.a. independent variable) \(X\): \[ Y = \beta_0 + \beta_1 X + \epsilon, \] where \(\epsilon\) is a mean-zero normal random variable with variance \(\sigma^2\).

Given a collection of observations \((X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\), our goal is to

  1. Estimate \(\beta_0\) and \(\beta_1\)
  2. Use those estimates to predict the responses for other values of the predictors

We then discussed the case of multiple linear regression, in which we have more than one predictor variable for each of our observations.

Example: we might have a number of different pieces of information about a house (e.g., square footage, age of the house, school district, proximity to a park, etc), and our goal might be to predict the price of that house.

Some things to review:

Logistic Regression

One problem with linear regression is that it predicts that our response is… well, linear. That is, our predicted response is a linear function of the data. We saw that this is not so appropriate when our response is, for example, binary. In such situations, logistic regression is more appropriate.

Logistic regression models our observed response \(Y\) as a Bernoulli random variable, with success probability determined by a linear function of the predictors: \[ \Pr[ Y=1 ] = \sigma( \beta_0 + \beta_1 x_1 + beta_2 x_2 + \beta_p x_p ), \]

where \(\sigma( \cdot )\) is the logistic function, \[ \sigma( z ) = \frac{ 1 }{ 1 + \exp\{ -z \} }. \]

Some things to review:

Cross-Validation and model selection

One question that was left largely unaddressed in our first few weeks of discussion of regression and prediction was how we decided what variables to include in our model to begin with. This decision is the problem of variable selection, an example of the more general problem of model selection.

Example: recalling our housing example, there are many different pieces of information that we could collect about a house for use in predicting its price. If we try to use every available piece of information, then we are liable to over-fit our data (unless we have a truly huge number of observations). How do we decide which pieces of information to include in our model and which to ignore?

Answering the model selection problem requires that we be able to compare different models fitted to the same data. There are a number of different ways to do this, but we saw two basic tools: cross-validation and feature selection methods.

Cross-validation (CV)

Cross-validation is a general tool for comparing models. What we want is an estimate of how well our model would do if we used it to do prediction on previously unseen data. To do that, cross-validation trains our model on some of our data, and then evaluates its performance on the rest of our data, which was not used in our training of the model.

We discussed three different approaches to CV:

  • a “naive” approach that split the data in two
  • Leave-one-out cross validation (LOOCV), in which we iteratively took all but one data point, set that one data point aside trained a model on the \(n-1\) remaining points, and then evaluated our model’s performance in predicting the held-out data point
  • \(K\)-fold cross validation, in which we split our data into \(K\) “folds” and then iteratively set each fold aside, train a model on the remaining \(K-1\) folds of data, and evaluate our model’s performance on the held-out fold.

Some things to review:

  • You should be able to implement any of the three CV approaches outlined above (indeed, note that they can all be obtained by writing one function!)
  • Explain why \(K\)-fold CV is preferable to the “naive” half-and-half split and to leave-one-out CV
  • Use cross-validation to choose among a collection of models

Feature selection and regularization

Another set of tools for choosing among a class of regression models is feature selection. Rather than trying to use cross-validation, which requires setting aside data to perform model fitting, feature selection methods attempt to compare multiple model simultaneously.

You should be able to

  • Explain (but not necessarily implement) the basic ideas behind subset selection and step-wise feature selection (both forward and backward)
  • Explain the basic ideas behind ridge regression and the LASSO, including the role of the regularization parameter \(\lambda\)
  • Apply ridge regression and/or the LASSO to a given data set

The Bootstrap

Our last full week of lecture concerned the bootstrap, a method for estimating the variance of a statistic when simpler approaches (e.g., the sample variance or a plug-in approach) are not feasible.

The basic setup is that we have a parameter \(\theta\), and an estimator of that parameter \[ \hat{\theta} = \hat{\theta}(X_1,X_2,\dots,X_n). \]

We know that \(\hat{\theta}\) is approximately normal, with mean \(\theta\) and variance \(\nu^2/n\), so that if we knew \(\nu^2\), we could compute a confidence interval based \[ \hat{\theta} \sim N( \theta, \frac{\nu^2}{n}). \]

Unfortunately, in many situations, \(\nu\) is not easy to estimate. The bootstrap estiamtes this variance by sampling with replacement from the data sample \(X_1,X_2,\dots,X_n\) to obtain a bootstrap sample \(X_1^*, X_2^*, \dots, X_n^*\), on which we can compute our statistic to obtain a bootstrap replicate \[ \hat{\theta}^* = \hat{\theta}( X_1^*, X_2^*, \dots, X_n^* ). \]

Repeating this procedure multiple times to obtain replicates \(\hat{\theta}^*_1, \hat{\theta}^*_2, \dots, \hat{\theta}^*_B\), we can estimate \(\nu^2/n\) according to \[ \frac{1}{B-1} \sum_{b=1}^B \left( \hat{\theta}^*_b - \frac{1}{B} \sum_i \hat{\theta}^*_i \right)^2. \]

Some things to review: