Chapter 1 Using tests to inform policy
Most of this SOP dives into our statistical decision making. It assumes that the reader has heard of hypothesis tests and statistical estimators. Here, we first explain in broad terms how statistical tests and estimators are useful for policy learning.
“Evidence-based policy” can refer to both “evidence-as-insight” (designing new policies based on prior empirical evidence) and “evidence-as-evaluation” (carefully designing studies to learn about how and whether a new policy works). For more on this distinction, see Bowers and Testa (2019). Our team aims to help partners with both the designing and learning elements of evidence-based policy. However, the OES SOP focuses on the learning part of our work.
How would we know whether and how a new policy worked? In an ideal and unrealistic case, we would know that a new policy altered the behavior of a single person, John, by comparing his decisions both with and without the new policy at the same moment in time (i.e., if we could observe two different realities at once). If we saw that John’s decisions were better under the new policy than under the status quo, we would say that the new policy caused John to make better decisions.
Since no one can observe John in two different situations at once — say, making health decisions with and without access to a new procedure for visiting the doctor — researchers try to find other people who represent how John would have acted without being exposed to the new policy. Holland (1986) calls this the “fundamental problem of causal inference” and explains more formally when we might believe that other people are a good approximations for how John would have acted without the new policy. For example, if access to the new policy is randomized, it is easier to argue that that the two groups (new policy vs status quo) are good “counterfactuals” for each other. Our team tends to think about the causal effects of a policy in counterfactual terms.
This in mind, we often use randomized experiments to create groups of people who represent behavior under both the new policy and the status quo. In medical experiments, these two groups tend to be called the “treatment group” and the “control group.” Social scientists often use that same language even if we are often not really providing a new treatment, but are, instead, offering a new communication or decision-making structure. If we pilot a new policy by offering it to people chosen at random, we can use what we see from one group to learn about what would likely have happened if the control group had received the new policy instead, or if the treatment group had not received it.
In any given sample, we won’t know exactly how the treatment group would have behaved if they had been assigned to the control group instead. For example, imagine we pulled 500 out of 1000 names from a hat and assigned those 500 people to receive treatment. Any “treated” sample we observe is just one of many possible sets of 500 people we could have drawn. If we were to do the experiment again, and pulled a different 500 names at random, this second experiment will also have a randomly assigned treatment group, but the second 500 people will be at least a little different from the first 500 people. It is important to ask questions about how this influences our findings (“statistical uncertainty”). How much could our result differ by chance, just due to pulling a different 500 people from the hat, rather than due to a real treatment effect?
Our team draws on statistical theory to estimate the causal effect of a new policy and its associated statistical uncertainty. We also use statistical theory to answer questions like “Is this effect meaningfully distinguishable from zero?” or “How many people do we need to observe in order to distinguish a positive effect from a zero effect?” The rest of this document presents decisions we have made about the particulars of estimators and tests, as well as other tricky decisions that we have had to confront—like how we can learn from pre-intervention data to design experiments that provide more precise results.
For more on the basics of how statistics helps us answer questions about causal effects, we recommend chapters 1–3 of Gerber and Green (2012) (which focuses on randomized experiments) and the first part of Paul R. Rosenbaum (2017) (which focuses on both experiments and research designs without randomization). Another good treatment comes from the opening chapters of Angrist and Pischke (2009).