Chapter 2 Key design criteria

Here, we briefly define and describe some of the things we prioritize when choosing statistical procedures or designing evaluations. To summarize, we want to observe a reasonably precise estimate of a policy intervention’s impacts (with a small margin of error). We also want to use statistical tests that will rarely mislead us—i.e., will rarely give a false positive or false negative result—and we want to estimate policy impacts without bias (without systematically over-estimating or under-estimating the truth). These properties depend on both the initial design of the study and the choices we make when calculating results.

2.1 High statistical power

The research designs we use at OES aim to enhance our ability to distinguish signal from random noise: estimates from studies with very few observations are subject to more random noise and therefore cannot tell us much about the treatment effect. In contrast, studies with very many observations are subject to less noise and provide a lot of information about the treatment effect. A study which effectively distinguishes signal from noise has excellent statistical power, while a study which cannot do this has low statistical power. The Evidence in Governance and Politics (EGAP) Methods Guide 10 Things You Need to Know about Statistical Power explains more about what statistical power is and how to assess it.

Before we field a research design, we assess its statistical power. If we anticipate that the intervention will only make a small change in peoples’ behavior, then we will need a relatively large number of people in the study: too few people would result in a report saying something like, “The new policy might have improved the lives of participants, but we can’t argue strongly that this is so because our margin of error is too wide.”

2.2 Controlled error rates

A good statistical test rarely rejects a true hypothesis and often rejects false hypotheses. For us, the relevant hypothesis for error rate control is generally the null hypothesis of no treatment effect.² It is a norm in many applied fields to design tests so that, over the long run (i.e., across many studies), the rate of rejecting a true null hypothesis will be no more than 5%, while the rate of rejecting a false null hypothesis will be at least 80%.³ The EGAP Methods Guide 10 Things to Know about Hypothesis Testing describes more about the basics of hypothesis tests. Our team tries to choose testing procedures that are not likely to mislead analysts (i.e., adequately control both error rates), both when we make our analysis plans and as we complete our analyses and re-analyses.

2.3 Unbiased estimators

While the other criteria focus on the problem of random noise, we also need to think about whether an estimator produces results that are consistently wrong in a particular direction. A good estimation procedure (i.e., a method of estimating a treatment effect) should not systematically over-estimate or under-estimate the truth—i.e., it should be unbiased. The difference-in-means between a randomly assigned treatment and control group is well known to be an unbiased estimator of the sample average treatment effect, so this is often a primary quantity of interest that our team reports. In practice, we usually estimate this using a statistical procedure called linear regression.

Even when dealing with binary outcomes, we still tend to prefer linear regression over other statistical models designed for binary outcomes such as logistic (Logit) regression. This is primarily due to to linear regression’s ease of interpretation and well-established properties as an average treatment effect estimator (Angrist and Pischke 2009; Aronow and Miller 2019). Secondarily, some researchers have raised concerns about the implicit assumptions raised by using Logit regression to estimate average treatment effects (Gomila 2021; Freedman 2008b), though it is worth noting that the salience of these concerns depends on how the Logit model is being used. We do consider alternative models or estimators in situations where we think they will significantly out-perform linear regression.

References

Angrist, Joshua D, and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton university press.

Aronow, Peter M, and Benjamin T Miller. 2019. Foundations of Agnostic Statistics. Cambridge University Press.

———. 2008b. “Randomization Does Not Justify Logistic Regression.” Statistical Science 23 (2): 237–49.

Gomila, Robin. 2021. “Logistic or Linear? Estimating Causal Effects of Experimental Treatments on Binary Outcomes Using Regression Analysis.” Journal of Experimental Psychology: General 150 (4): 700.

In standard statistical procedures, this will be the null of no treatment effect on average. But we might also sometimes test a null of no treatment effect for any unit. See Chapter 3.↩︎
This latter rate is referring to statistical power, discussed above.↩︎