These notes are meant to give a very brief introduction to some ideas in an emerging area of statistics called causal inference. This material will not be on any or homework. The lecture is meant to give you a glimpse of one currently very active area of statistical research.
After this lecture, you will be able to
(Okay, it’s actually named after https://en.wikipedia.org/wiki/Edward_H._Simpson, but I couldn’t resist)
Let’s start today by looking at an especially famous data set from a real medical study. The data concerns treatment of kidney stones, and was meant to compare two drugs, A and B, and their effects on small and large kidney stones. The data is reproduced below (see here for the original study).
We have two different treatments, and two different stone sizes.
Under each combination of treatment and stone size, we record how many
patients had that stone size and that treatment (trials
)
and how many of those patients responded positively to treatment
(successes
).
treatments <- c( 'A', 'B', 'A', 'B' );
stonesize <- c('Small', 'Small', 'Large', 'Large' );
trials <- c( 87, 270, 263, 80 );
successes <- c( 81, 234, 192, 55 );
expt_outcome <- data.frame(drug=as.factor(treatments), size=as.factor(stonesize),
trials=trials, successes=successes )
head(expt_outcome)
## drug size trials successes
## 1 A Small 87 81
## 2 B Small 270 234
## 3 A Large 263 192
## 4 B Large 80 55
So here is a simple question: which treatment is better: treatment A or treatment B?
Well, let’s compare their success rates. Treatment A has \(87 + 263 = 350\) trials, of which \(81+192=273\) were successful. That’s a success rate of \(273/350 = 0.78\).
A similar calculation for treatment B finds a success rate of \((234+55)/(263+80)=0.84\).
So it would seem that treatment B is better than treatment A.
But that’s not the whole story. Let’s consider the effectiveness of the two drugs on the small and large stones separately.
On small stones, treatment A has a success rate of \(81/87 = 0.93\), while treatment B has a success rate of \(234/270=0.87\). On large stones, treatment A has a success rate of \(192/263= 0.73\), while treatment B has a success rate of \(55/80=0.69\).
In other words, it would seem that while B is the more effective treatment when we look at all kidney stones in aggregate, A is the more effective treatment when considered separately on each of the two different stone sizes.
The source of this particular disparity arises from the fact that while treatment B was less effective, doctors were prescribing it more frequently for small kidney stones, which are easier to treat. On the other hand, the stronger medication A was prescribed more often for large kidney stones, which are harder to treat, but medication A was indeed more successful than treatment B in treating those stones.
Treatments were assigned to patients based on their covariates (in this case, based on kidney stone size), yielding in strange results! We’ll come back to this point shortly.
Here’s a visual example:
Notice that if we analyze y
as a function of
x
while ignoring the two different red and blue categories,
we would conclude that y
decreases as x
increases. On the other hand, if we examine the effect of x
separately within each of the red and blue categories, it would
seem that increasing x
decreases
y
.
This phenomenon is referred to as Simpson’s paradox. In general, though, there is no paradox per se when we see this phenomenon. The answer we get depends on how we choose to analyze the data– that’s not a paradox at all!
Simpson’s paradox is not the result of doing anything wrong, in general. It happens by chance in data quite often. Still, oftentimes this phenomenon indicates that we may be mistakenly ignoring something important in our data, like preferentially treating different subjects in different ways or ignoring the red vs blue categorical variable.
For example, see the Wikipedia page on Simpson’s paradox for an example from baseball comparing two players’ batting averages for two different years.
Simpson’s paradox arises when we choose to ignore (or not ignore) a particular variable’s effect on some outcome of interest. Let’s keep looking at that kind of problem from another angle.
Suppose we have a theory that coffee consumption causes increased predilection for cat ownership. And I do mean “causes”– as in, philosophical questions of “what does causality mean” aside, increasing coffee consumption will directly result in more cat ownership.
(Note that this running example is just a stand-in for any study trying to associate some behavior or other with disease– substitute your favorites as you see fit!) How might we test this theory?
The easiest (for some definition of the word “easy”, anyway) is to perform an observational study, whereby we collect relevant data (i.e., people’s cat ownership status and coffee consumption habits), and then look at the extent to which increased coffee consumption is associated with increased cat ownership.
The Pima Indian data set, which we discussed at length last week in lecture and discussion section, is an example of an observational study. The study authors recruited participants from the community of interest and recorded information about each subject in the study (e.g., glucose levels, age, blood pressure, diabetes status, etc.). The data is observational because it is based on observations of a population– it’s as though we just kind of happen upon our data, already existing out there in the world.
The trouble with observational data like this is that in general, we cannot use it to answer questions about causality. Suppose that we observe that people who drink more coffee tend to be more likely to own cats. Does increased coffee consumption cause increased risk of cat ownership? Well, possibly, but the direction of causality might go the other way: owning more cats might make a person more enthusiastic about coffee. Alternatively, there may even be a common cause: perhaps wealthier people are both more inclined to drink coffee (because they have time and money to do so) and are more likely to own cats (because they have the time and money to do so).
We cannot tell these different situations apart merely based on observational data: our data describes the world as it exists, but it doesn’t tell us anything about what happen if we reassigned someone to drink two cups of coffee a day instead of only one. If such a person were indeed more likely to own cats, this would be evidence that coffee consumption does indeed “cause” cat ownership!
But we cannot answer kind of counter-factual question, “what would have happened?”, using observational data.
What we would like to do is to observe what “would have happened” had we assigned someone to drink two cups of coffee a day, and compare that person’s outcome to the universe in which that same person only drinks one cup a day (or none, or three or…). Of course, this is impossible.
The next best thing to this is a randomized controlled trial (RCT). We have a collection of test subjects, drawn randomly from our population of interest (e.g., undergraduate students at UW-Madison enrolled in data science and/or statistics courses). We take our sample of test subjects and assign them randomly to a treatment (or control).
For example, to study the impact of coffee consumption on cat ownership, we might recruit study participants, record their cat ownership status, and then assign each participant at random to drink either zero, one or two cups of coffee per day.
Under this setup, we don’t get to observe the “counterfactual” universe in which we see what would have happened to the same study participant assigned to each of the different coffee consumption conditions. Nonetheless, because each subject is assigned randomly to a treatment regime, we get to observe this effect “on average”. Since assignment of each study participant to treatment regim is random, any difference that we observe among the different treatment groups must be due to that treatment. That is, if we observe a (statistically significant) difference in cat ownership between participants drinking different amounts of coffee, that difference must be due to the amount of coffee they drank– we can make a causal attribution!
Example: drug trials RCTs are especially common in testing the efficacy of drugs. For example, in the efficacy trial for the Moderna COVID-19 vaccine, the 30,000 study participants were assigned randomly to receive either a placebo (i.e., control) or vaccine. Because this assignment was made randomly and independently of any information about the study participants, any observed difference between the placebo and treatment groups must be due to an effect of the vaccine– after all, the only difference between the groups (on average) is whether they received placebo or vaccine.
There are a few important points about the above ideas that we need to be careful of. Let’s illustrate them with examples.
Example: treating depression. Suppose that we have a theory that candy cures sadness. To test our theory, we go to the local coffee shop and give a lollipop to every person who looks sad. Then we give a questionnaire to everyone in the coffee shop to screen for whether or not they are sad. When we compare our treated group (people given candy) to our untreated group, we find that people who received candy are more likely to report being sad!
Of course it’s clear why the above fictional experiment is flawed, but it illustrates an important concern in experiments: we need to make sure that our assignment of participants to treatment or control does not depend on their covariates (i.e., on other information that we know about them).
If we fail to ensure this, we’re back where we started: the thing we want to study is correlated with other variables in the data!
Example: eating healthy. Suppose that we are running an experiment to assess the impact of a Mediterranean diet on people’s cholesterol levels. Study participants are assigned to either eat a Mediterranean diet or to continue to eat normally. Suppose, however, that some of our study participants know eachother, and tend to eat lunch together. If our study participants assigned to the treatment group start eating healthier, they might induce the control participants to tag along with them to the Mediterranean restaurant for lunch a few times a week. Our subjects’ treatments are interfering with one another.
We need to make sure that the outcomes that we measure for our subjects do not interfere with one another. If they do, we will run into trouble when trying to estimate the effect of our treatment.
These are exactly the kinds of issues we are trying to avoid in performing randomized controlled trials– randomness and independence of other covariates ensures that the only thing different between our treated and control groups is… whether they received treatment or control! If we achieve that, any different we observe between the groups must be due to the difference in treatment.
As you may or may not be aware, in some states, it is illegal to pump your own gas. There are a number of arguments (all of them rather silly, in my view, but my opinion doesn’t count for much) for why to have such a policy, but one argument against it is that it causes gas prices to be higher than they otherwise would, because gas stations must pay and attendant to pump gas. The U.S. states of New Jersey and Oregon have had such laws on the books for a long time.
How might we test whether or not these “no pumping your own gas” laws cause higher gas prices?
After all, we can’t randomly assign people to different states or different legal systems… What do we do?
Well, occasionally, nature (or the legal system) performs what we term a natural experiment, in which populations are exposed (or not exposed) to a treatment of interest due to random factors (or, at any rate, factors that are outside of their control).
For example, suppose that another state, say, Michigan, decided to pass a “no self-service” law, becoming like Oregon and New Jersey. This would be an example of a natural experiment: people in Michigan (and their consumer habits) are largely comparable to people in neighboring states (e.g., Ohio, Indiana, Illinois, Wisconsin). So, to measure the effect of “no self-service” laws on gas prices, we could compare how gas prices in Michigan behave from the time just before the law until just after, and compare that behavior to that observed in the other, similar states.
Roughly speaking, we could treat Michigan as though it were “randomly” assigned to be a “no self-service state”, and use other similar states as controls!
If there is no effect on gas prices, then gas prices in Michigan should continue to behave similarly to those in other similar states. On the other hand, if we observe a difference in how Michigan gas prices behave after the new law when compared with other states that didn’t enact a new gas pumping law, that would be evidence that the new law had an effect on gas prices.
This is the core idea between the difference in differences framework: we have something changing over time (i.e., gas prices), and we want to assess whether the change over time was different in different environments (i.e., in the presence or absence of )
See, at the time one or more states enacts a “no self-service” law, many other things are changing as well: oil prices, global financial markets, conflicts overseas, seasonal demand for gas (e.g., more people drive in the summer, increasing the cost of gas). How do we separate those changes from the changes caused by the change in the law?
Well, if these broad other changes (i.e., seasonal effects, etc) affect all of the states equally, then we should be able to, in a sense, “subtract off” that effect from all of our states.
To see how this works in practice, let’s suppose that we have data from two different groups: say that we have \(2n\) subjects, \(n\) of whom are in the “control” condition (say, gas stations in states that have not implemented “no self-service” laws), and \(n\) of whom are in the “treatment” group (say, gas stations in a state that has recently implemented such a law).
Let \(i=1,2,\dots,n\) be the subjects from the control group, and \(i=n+1,n+2,\dots,2n\) be the subjects in the treatment group.
For each such subject, suppose that we measure our quantity of interest (i.e., gas prices) at some time \(0\) before the change (i.e., before the enactment of the law in our “treatment” state), and at some other time \(1\) after the change.
Then, for each \(i=1,2,\dots,n,\dots,2n\), we have measurements \(y_{it}\) for \(t=0\) and \(t=1\).
Since all these measurements were made at the same two times, we can very easily compare the quantity \(y_{i,1} - y_{i,0}\) for our treatment subjects and our control subjects using ideas from statistical testing that we have discussed at length this semester.
But what if we don’t observe all the measurements on the same schedule? That is, what if we observe \(y_{i,t}\) for lots of different values of \(t\), and that those observation times are not guaranteed to be the same from one subject to the next, let alone from the control group to the treatment group?
Well, that’s where difference in differences comes in. While we don’t have time to go into detail in lecture, the basic idea is as follows: what if we use our data to regress our response of interest against a variable encoding time (i.e., \(t\)), and a “after the treatment” variable encoding whether or not a measurement was made under the treatment of interest (i.e., after the enactment of the law in the state that actually enacted the law).
If there were no effect of the law, the trajectory of how gas prices change over time would not depend on the “after the treatment” coefficient. On the other hand, if there were an effect of the law, the two trajectories would be the same until the enactment of the law, at which point the trajectories would be different in the two states. That would manifest as an interaction between the “time” and “after the treatment” variables! This gives us a way to test whether or not the law has an effect!
This framework was famously used in a study to assess whether or not raising the minimum wage caused a change in employment. See this section of the DID Wikipedia page for an overview.
In short, in 1992, New Jersey passed a law to raise the minimum wage, while neighboring Pennsylvania did not. David Card and Alan Krueger used this natural experiment to assess whether or not employment levels changed differently in the two states, reasoning that the habits and consumer behaviors of Pennsylvania and New Jersey are otherwise mostly comparable.
Below is the data (as reproduced by Wikipedia). The law was enacted in April of 1992:
In other words, raising the minimum wage in New Jersey was associated with a small increase in employment, as measured by full-time equivalents, a way of comparing hours worked across different sectors in which not all workers are employed for the same number of hours per week.
David Card was among the recipients of the 2021 Nobel Prize in Economics for his work on some of the accompanying statistical tools we’ve discussed briefly today.