Understanding t-tests as Linear Models

Introduction

Objective: Understand how a t-test can be conceptualized as a special case of a linear model.

The Simple Linear Model:

Basic Idea: Relationship between variables using a linear equation.
Equation: \(Y = \beta_0 + \beta_1 X + \epsilon\)
“outcome” \(Y\)and “feature” \(X\)are numbers.
\(\epsilon\)is noise.
Just algebra… take your feature, multiply by \(\beta_1\), add \(\beta_0\), and add noise \(\epsilon\).
You measure pairs \((X,Y)\)for \(n\)different independent observations. With these values, we estimate \(\beta_0\)and \(\beta_1\).
“333 in one slide!”

Review of t-tests

Purpose: Compare means between two groups.
Types:
Independent samples t-test.
Paired sample t-test.

Example: Computing a t-test in R

We’ll use a simple built-in dataset in R.
Perform an independent samples t-test.

## Loading the data
library(tidyverse)
data(mtcars)
mtcars = as_tibble(mtcars)
mtcars

# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Background poll

Raise your hand if you’ve seen (or have used):

%>% aka |>|
mtcars |> mutate()
mtcars |> select()
mtcars |> filter()
mtcars |> group_by()
mtcars |> summarize()
pivot_wider / pivot_longer

Approximately 80% of data analysis is using these function. Do you learn that in the 303 sequence? We will cover/review them in this class. In these packages:

library(dplyr); library(tidyr)

Let’s perform an independent samples t-test

Comparing miles per gallon (mpg) between
automatic (am = 0) and
manual (am = 1) transmission cars
The data:

mtcars = mtcars |> 
  mutate(transmission = if_else(condition = am==1, 
                                true = "manual", 
                                false = "automatic")) 
mtcars |> 
  select(mpg, am, transmission)

# A tibble: 32 × 3
     mpg    am transmission
   <dbl> <dbl> <chr>       
 1  21       1 manual      
 2  21       1 manual      
 3  22.8     1 manual      
 4  21.4     0 automatic   
 5  18.7     0 automatic   
 6  18.1     0 automatic   
 7  14.3     0 automatic   
 8  24.4     0 automatic   
 9  22.8     0 automatic   
10  19.2     0 automatic   
# ℹ 22 more rows

I like to visualize before testing

Could this is cheating in some settings?

mtcars |> ggplot(aes(x= mpg)) + 
  geom_density() + 
  facet_wrap(~transmission)

Perform an independent samples t-test

t.test(mpg ~ transmission, data = mtcars)


    Welch Two Sample t-test

data:  mpg by transmission
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group automatic and group manual is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group automatic    mean in group manual 
               17.14737                24.39231

How do you interpret this result?

Highly statistically significant!

Is this a causal effect?
yes and no!
Manual transmissions are (historically) more efficient…
but not 7 mpg…
What might be some confounding variables?

Expanded Independent Samples t-test

Scenario: Comparing means of two independent groups, say, Treatment (T) and Control (C).
Statistical Model:
Let \(Y_{Ti}\)and \(Y_{Ci}\) be the response variables (mgp) for the treatment (manual) and control (automatic) groups, respectively.
What is \(i\)?
We assume \(Y_{Ti} \sim N(\mu_T, \sigma^2)\) and \(Y_{Ci} \sim N(\mu_C, \sigma^2)\), where \(\mu_T\)and \(\mu_C\) are the group means, and \(\sigma^2\) is the common variance.
Null Hypothesis:
\(H_0: \mu_T = \mu_C\)
This means there is no effect of the treatment.

Linear Model Interpretation:

t-test notation:
\(Y_{Ti}\)and \(Y_{Ci}\) be the response variables (mgp) for the treatment (manual) and control (automatic) groups, respectively.
We assume \(Y_{Ti} \sim N(\mu_T, \sigma^2)\) and \(Y_{Ci} \sim N(\mu_C, \sigma^2)\), where \(\mu_T\)and \(\mu_C\) are the group means, and \(\sigma^2\) is the common variance.
Turn the t-test into a linear model:
We want to “shoe horn” the above model into a linear model, i.e. this notation: \(Y = \beta_0 + \beta_1 X + \epsilon\)
Hint: make \(X\) is a binary indicator or a “dummy variable” (0/1).
Express \(\beta_0\) and \(\beta_1\) in terms of \(\mu_T\) and \(\mu_C\).
What is the Null Hypothesis in Linear Model coefficients \(\beta_0\) and \(\beta_1\)?

Paired Sample t-test

Scenario: Comparing the same group under two different conditions.
Statistical Model:
Let \(Y_{1i}\) and \(Y_{2i}\) be the response variables for condition 1 and condition 2, respectively, for the \(i\)-th subject.
Assume \(Y_{1i} \sim N(\mu_1, \sigma^2)\) and \(Y_{2i} \sim N(\mu_2, \sigma^2)\).
Shoe horned into a linear model:
Define \(\Delta Y_i = Y_{2i} - Y_{1i}\). The assumptions imply this is normally distributed.
\(H_0: \mu_1 = \mu_2\) or equivalently \(H_0: \mu_{\Delta} = 0\) where \(\mu_{\Delta}\) is the mean of \(\Delta Y_i\).
Model: \(\Delta Y = \beta_0 + \epsilon\)
Here, \(\beta_0 = \mu_{\Delta}\).
Null Hypothesis in Linear Model Terms: \(H_0: \beta_0 = 0\)

t-test as a Linear Model

Key Concept: Comparing means can be reframed as a linear model.

Independent Samples t-test

Scenario: Comparing two independent groups (e.g., Treatment vs. Control).
Linear Model Framework:
Group membership as a categorical variable.
Model: \(Y = \beta_0 + \beta_1 X + \epsilon\)
Interpretation:
\(\beta_1\) significantly different from zero indicates a group difference.

Paired Sample t-test

Scenario: Comparing the same group under two conditions.
Linear Model Framework:
Difference in scores as the outcome.
Model: \(\Delta Y = \beta_0 + \epsilon\)
Interpretation:
Testing if \(\beta_0\) is significantly different from zero.

Linear models look cold and bunk when it is just notation

Hopefully in these simple examples above,

Became aware of mutate to make new columns/variables and select to pick columns/variables.
refreshed on t-tests
saw how the paired samples t-test is a linear model with no \(X\) and the independent samples t-test is a linear model with \(X\) as a “dummy variable”
hardest part: “doing algebra with models”

Further Discussion 1

In the t-test above, there are two possibilities “T/C”. What if there are more than two? mtcars gives the number of cylinders cyl.

mtcars |> count(cyl)

# A tibble: 3 × 2
    cyl     n
  <dbl> <int>
1     4    11
2     6     7
3     8    14

How might we extend the t-test to more than two groups? What is the model/null hypothesis? What is the linear model?

Further Discussion 2

If we are interested in the causal effect of transmission, then we need to realize that there are confounders in the analysis above. For example, automatic cars tend to be heavier (that’s bad for mpg):

mtcars |> mutate(weight = wt) |> 
  ggplot(aes(x= weight)) + 
  geom_density() + 
  facet_wrap(~transmission)

This gets to the core of linear models and why they are so important.

Moar plots Karl!

mtcars |> mutate(weight = wt) |> 
  ggplot(aes(x= weight, y = mpg)) + 
  geom_point(aes(color = transmission)) + 
  geom_smooth(method = "lm")

Moar plots Karl!

mtcars |> mutate(weight = wt) |> 
  ggplot(aes(x= weight, y = mpg)) + 
  geom_point(aes(color = transmission)) + 
  geom_smooth(aes(color = transmission),method = "lm")