Chapter 7 Communicating evidence

Some claim to “let the data speak.” But evaluators can’t just report results from a statistical test and call it a day. The data might speak, but we still have to translate. Deciding how to best distill evidence and communicate it to an audience with other skill sets and responsibilities is not always straightforward. What conclusions can we support based on our tests, and which can we not support? Is a non-technical summary sacrificing accuracy in a meaningful way? We think about these kinds of questions often. They also motivate high-profile discussions about how applied researchers should be evaluating statistical evidence in the first place (McShane et al. 2019; Amrhein, Greenland, and McShane 2019).

This chapter reviews a few related topics that frequently come up in our work:

How we tend to talk about common quantities we estimate
How we make sense of “null” results
Testing for effectiveness vs testing for a lack of negative effects
Incorporating cost estimates into our interpretations of evidence

As with other chapters, this content is subject to change, and the goal is not to require OES team members to present evidence in a particular way. Instead, the goal is just to help OES Team Members coordinate around a common language, and to provide examples of how we have thought about these issues in the past.

7.1 Talking about common estimates

Earlier chapters discuss a few quantities we commonly calculate and report as part of an evaluation:

Average treatment effect estimates
p-values
Confidence intervals

It is important for credibility to remember (1) the correct technical definition of each. But it is also important to not lose sight of (2) what we hope to learn from them, or why we’re calculating these quantities in the first place. When there’s enough of a gap between these things, it might mean that we need to re-think something about our evaluation.

In the subsections below, we review (1) and (2) for all three quantities. The primary aim is to serve as a reference so that team members can more easily coordinate around how we want to discuss these estimates in different situations. We also want to help OES team members think about when they might want to deviate and report other summaries of our data instead.

7.1.1 Average treatment effects

7.1.1.1 What is it?

Refer back to the first chapter for a simpler introduction, or Holland (1986) for a more academic one. Imagine the same person in two different counter-factual scenarios: a world where they experience some policy change (the treated “potential outcome”), and an alternative world in which they don’t (the control “potential outcome”).

Here are potential outcomes for the first six “people” in the example dataset we’ll use in this chapter. The variables y0 and y1 are control and treated potential outcomes, respectively, while Z is a person’s treatment status, Y is the actual outcome we observe given treatment, and tau is the true treatment effect for any one person.

## Create an ATE variable
dat1$tau <- dat1$y1 - dat1$y0

## Plot the first 6 rows
dat1[1:6,c("y1","y0","Y","Z","tau")]

** Create an ATE variable
gen tau = y1 - y0

** Plot the first 6 rows
list y1 y0 y z tau in 1/6

      y1     y0      Y Z   tau
1  6.123  0.000  0.000 0 6.123
2  2.741  0.000  0.000 0 2.741
3  7.215  0.000  0.000 0 7.215
4  6.708  2.056  2.056 0 4.652
5  5.654  0.000  5.654 1 5.654
6 15.561 10.409 10.409 0 5.152

We can’t observe anyone under treatment and control simultaneously. Instead, we estimate the difference between the average $Y$ among the $m$ people that actually receive treatment— $\frac{1}{m}\sum_{j=1}^{m}Y_{j}$ —and the average $Y$ among the $n-m$ people that don’t— $\frac{1}{n-m}\sum_{k=1}^{n-m}Y_{k}$ . Assuming we have a well-designed study otherwise (chapters 3-5), this should let us estimate the average tau across people in our dataset with some degree of random error. This is the average treatment effect (ATE), $\frac{1}{n} \sum_{i=1}^{n} \tau_{i}$ .

## Actual average treatment effect
print("Actual ATE")
mean(dat1$tau)

## Sample estimate of the ATE: manually
print("Sample estimate of ATE")
mean(dat1$Y[dat1$Z==1]) - mean(dat1$Y[dat1$Z==0])

## Sample estimate of the ATE: regression
#as.numeric(lm(Y ~ Z, data = dat1)$coefficients[2])

** Actual average treatment effect
di "Actual ATE"
qui sum tau, meanonly
di "`r(mean)'"

** Sample estimate of the ATE: manually
di "Sample estimate of ATE"
qui sum y if z==1, meanonly
local mean_z1 = r(mean)
qui sum y if z==0, meanonly
local mean_z0 = r(mean)
di `mean_z1'-`mean_z0'

** Sample estimate of the ATE: regression
*qui reg y z
*di _b[z]

[1] "Actual ATE"
[1] 5.453
[1] "Sample estimate of ATE"
[1] 4.637

In this example, while the true ATE is 5.4525, the average difference between treated and non-treated people is 4.6374. In a well-designed study we’d expect treated and non-treated people to still differ some just by random chance regardless of what the true individual-level treatment effects are. This means there’s likely at least a little random noise in our ATE estimate (in real data we can’t just take the mean of tau directly). The next two quantities help us think about what this potential random noise might mean for our confidence in our findings.

7.1.1.2 Why do we calculate it?

Often, we’d like to do more than just say “the policy works” or “the policy doesn’t work.” We’d like to provide some kind of measure of how well it works. This can be important for policy learning: we’d like to be able to say not just “the procedural reform made application processing costs cheaper,” but “the procedural reform cut processing costs by $0.50 per application, on average.” An average treatment effect is the most common method of quantifying how well policy programs work in fields like political science and economics. It’s usually the method we apply.

The ATE is the dominant way of quantifying effects for a “typical person.” But there are other ways of understanding what a “typical person” is. For instance, what if our outcome measure is very skewed, and so we’re less interested in the average effect, but more interested in the median effect? In that case we might want to estimate something like a quantile regression that targets percentile effects directly.

Also, sometimes we actually might not be interested in a typical effect. Maybe we really do just want to know whether there was any kind of change in the distribution of outcomes across the treatment/control groups. This is one strategy we could follow if we’re evaluating whether a new policy has any negative consequences that might be masked by only evaluating average differences. In that case, we might be interested in tests that sacrifice providing a single number summary in return for helping us detect other types of changes in the outcome distribution (Bowers, Fredrickson, and Panagopoulos 2013; Lin et al. 2017).

Here are a few examples of tests we might use in that last situation:

## E.g., a Kolmogorov-Smirnov test for a
## difference in outcome distributions.
print("KS test for a difference in distributions")
ks.test(Y ~ Z, data = dat1)

## E.g., a rank-based test for a "location shift"
## in the distribution of Y.
print("MWU rank-based test of a shift in distributions")
wilcox.test(Y ~ Z, data = dat1)

** E.g., a Kolmogorov-Smirnov test for a
** difference in outcome distributions.
di "KS test for a difference in distributions"
ksmirnov y, by(z)

** E.g., a rank-based test for a "location shift"
** in the distribution of Y.
di "MWU rank-based test of a shift in distributions"
ranksum y, by(z)

[1] "KS test for a difference in distributions"

    Exact two-sample Kolmogorov-Smirnov test

data:  Y by Z
D = 0.6, p-value = 5e-07
alternative hypothesis: two-sided

[1] "MWU rank-based test of a shift in distributions"

    Wilcoxon rank sum test with continuity correction

data:  Y by Z
W = 360, p-value = 1e-06
alternative hypothesis: true location shift is not equal to 0

7.1.1.3 How do we report it?

We’ve found it easiest to communicate ATE estimates by starting with a baseline (“status quo”) estimate of the average value of the outcome, providing an initial reference level. We then report how much this outcome increases, on average, when the intervention is administered. Contrasting the typical baseline outcome with how much change the intervention introduces often helps make these numbers more concrete. It also helps with interpreting how “large” our ATE is.

In our data example, we estimate an average or “typical” status quo outcome of 2.132. We then estimate that the intervention increases this by 4.637 on average. That change represents a more than 200% increase above the baseline (more than twice as much)!

You can see a method of computing those estimates in the following example code:

## Baseline outcome rate.
## Assuming we might have control
## variables, let's compute this manually
## instead of using the regression intercept.
print("Average baseline outcome")
mean(dat1$Y[dat1$Z==0])

## Sample ATE estimate
mod <- lm(Y ~ Z, data = dat1)
print("Sample ATE estimate")
as.numeric(mod$coefficients["Z"])

## Percent change in baseline due to treatment
print("Percent change")
as.numeric(mod$coefficients["Z"]/mean(dat1$Y[dat1$Z==0]) * 100)

** Baseline outcome rate.
** Assuming we might have control
** variables, let's compute this manually
** instead of using the regression intercept.
di "Average baseline outcome"
qui sum y if z == 0, meanonly
local mean_z0 = r(mean)
di "`mean_z0'"

** Sample ATE estimate
di "Sample ATE estimate"
qui reg y z
di _b[z]
local ate = _b[z]

** Percent change in baseline due to treatment
local p_chng = `ate'/`mean_z0'*100
di "Percent change = `p_chng'"

[1] "Average baseline outcome"
[1] 2.132
[1] "Sample ATE estimate"
[1] 4.637
[1] "Percent change"
[1] 217.5

7.1.2 p-values

7.1.2.1 What is it?

The p-value is one of the primary ways applied researchers think about how random noise in their ATE estimates might undermine their confidence in their findings. There is a lot of work on the details of correctly interpreting p-values and understanding how they are calculated (Greenland 2019; Lakens 2021). We won’t try to replicate a statistics textbook. But we will quickly go over this, since it’s important for making sense of the sections below on interpreting null results or tests for “toxicity.”

We get one real ATE estimate from our sample. But there is some random noise in this estimate, and so our estimate is just one draw from a larger distribution of estimates that we could have possibly seen. Typically, researchers imagine that this random noise comes from sampling units to study from a broader population, or even imagining the laws of nature themselves as the underlying “super-population” we want to learn about (in either case, that larger distribution of estimates we could have seen would be called the “sampling distribution”). Or instead, as discussed in chapter 3, we can think of random assignment as the source of random noise (that larger distribution is then the “randomization distribution”).

Let’s focus on traditional sampling-based inference. Assume we’re testing a null hypothesis that the true average treatment effect is 0 (this is the default is most software commonly used in policy research). A two-sided p-value (or two-tailed) quantifies the percent of values in the sampling distribution that are as far from 0 as our ATE estimate (or further), assuming the null hypothesis is true. In other words, under a true null of no average treatment effect, what is the probability that random noise alone would produce evidence of an average treatment effect at least as strong as ours?⁴²

The p-value doesn’t tell us the probability that our finding in particular is a result of random noise. It only informs us about how often random noise alone, under a true null, could produce evidence like ours. See here for more discussion of some common misconceptions about p-values. It is very easy to accidentally explain this incorrectly!

Below is an illustration of computing p-values, manually or using programs in R/Stata that do it for us:

## Fit a regression model
mod <- lm(Y ~ Z, data = dat1)
df <- mod$df.residual

## Get estimates with HC2 errors instead of default SEs.
ct <- coeftest(mod, vcov. = vcovHC(mod, "HC2"))

## Compute t-stat from coef and SE
t <- ct[2,1]/ct[2,2]

## Compare to its null (of 0) sampling distribution.
## Compute two-sided p-value.
p <- 2 * pt(-1 * abs(t), df, lower.tail = TRUE)

## Same result we get from the model!
stopifnot(identical(ct[2,4], p))

# View
p

** Fit a regression model
qui reg y z
local df = e(df_r)

** Get estimates with HC2 errors instead of default SEs.
qui reg y z, vce(hc2)

** Compute t-stat from coef and SE
local t = _b[z]/_se[z]

** Compare to its null (of 0) sampling distribution.
** Compute two-sided p-value.
local p = (2 * ttail(`df', abs(`t')))

* View
di "`p'"

[1] 2.147e-05

7.1.2.2 Why do we calculate it?

We want to get a sense of whether the potential for random noise in our ATE estimates should make us doubt the accuracy of our findings. Researchers commonly accomplish this by comparing their p-values to a significance threshold, traditionally 0.05. If a p-value is less than that heuristic value, the finding is classified as “statistically significant.” Ultimately, we calculate p-values not because it is common to interpret them directly, but because it is a rule-of-thumb researchers follow to decide what findings should be hesitantly be treated as the truth.

We report results within this framework, sometimes called “null hypothesis significance testing” (NHST), because it is common in the academic fields most of us come from—it is important for policy learning for academic researchers and program evaluators to interpret statistical evidence using similar procedures. That said, we are also sympathetic to concerns with significance testing raised in some of the sources cited throughout this chapter. When appropriate, we are open to other methods of evaluating the plausibility of our findings given concerns about random noise in data.

7.1.2.3 How do we report it?

We generally report the p-values themselves, whether they are one-tailed or two-tailed, how they were calculated (if using an alternative inference procedure like randomization inference), and whether they indicate that our finding is statistically significant. Beyond this, we often do not interpret them further.

That said, p-values can be thought of as a measure of the relative “compatibility” of our data and analysis with the null hypothesis of no average treatment effect (Amrhein and Greenland 2022), or as measures of evidence against (or divergence from) the null hypothesis (Greenland 2023). A p-value of 0.9 means we estimate a 90% chance that random noise alone could produce evidence of an average treatment effect in our sample at least as strong as our effect estimate under a true null. This means our data and analysis are highly compatible with the null hypothesis, and could easily have appeared under a true null. There could sometimes be cases where it’s useful for OES projects to interpret p-values directly in this way.

7.1.3 Confidence intervals

7.1.3.1 What is it?

See the p-value section first. Like p-values, confidence intervals are easy to accidentally explain wrong.

Let’s start with the traditional definition. When we calculate a 95% confidence interval for our ATE estimate, we are saying that if we calculated a similar confidence interval around each estimate in the sampling distribution, 95% of them would contain the true average treatment effect (the true average of tau across units as in our example above). This does not mean that our confidence interval specifically has a 95% chance of containing the truth! It just means that a calculation procedure like this contains the truth 95% of the time (it is a property of the procedure and not the result). Among other things, confidence intervals summarize what we learn from looking at p-values: if a 95% confidence interval crosses 0 (includes positive and negative values), a finding is not statistically significant under the common 0.05 threshold.

That technically correct definition is often hard to think about, and generally never useful for communicating our results! A more useful alternative definition is that the confidence interval represents the range of null hypotheses we cannot reject with 95% confidence. For example, a confidence interval of -0.05 to 0.05 indicates that, if we tested a null hypothesis that the average treatment effect is 0.3 (instead of a null of 0), we would see a p-value less than 0.05: our evidence against this null is sufficiently strong that we can reject it with 95% confidence. In contrast, we would not be able to provide sufficient evidence to reject a null hypothesis of -0.025 with 95% confidence. Again, this phrase “with 95% confidence” is a property of the procedure used to produce the result, and not the result itself.

Here’s an example of computing a confidence interval, both manually and using programs that do it automatically.

## Get CI based on HC2 SEs
ci <- coefci(mod, vcov. = vcovHC(mod, "HC2"))

## Compute manually for comparison
ci2 <- c(
  ct[2,1] - (ct[2,2]*qt(0.975, df)),
  ct[2,1] + (ct[2,2]*qt(0.975, df))
  )

## Same result we get from the model!
stopifnot(identical(as.numeric(ci[2,]), ci2))

# View
ci2

** Get CI based on HC2 SEs
qui reg y z, vce(HC2)
local df = `e(df_r)'
local ci_low = r(table)["ll","z"]
local ci_up = r(table)["ul","z"]

** Compute manually for comparison
local crit_val = invttail(`df', 0.025)
local ci_low2 = _b[Z] - (_se[z]*`crit_val')
local ci_up2 = _b[Z] + (_se[z]*`crit_val')

** View
di "`ci_low2' to `ci_up2'"

[1] 2.576 6.699

7.1.3.2 Why do we calculate it?

Given concerns about random noise in our ATE estimate, we want an informal best guess for the range of values in which the truth might fall: “There may be some random noise in our estimate, but applying a method that contains the truth a vast majority of the time, our best guess is that the real effect falls in this range.” As with p-values, we calculate this because it is a common way for researchers in applied fields to evaluate statistical evidence. This is an area where it’s especially important to keep the technical definition and the motivation separate. This way of thinking about why we calculate confidence intervals often leads people to incorrectly treat them as having a 95% chance of containing the truth.

We might also calculate a 95% confidence interval for its second interpretation: because we want to know the range of null hypotheses we can or can’t reject with 95% confidence. If a value is outside this interval, it means that we are sufficiently confident (under standard heuristics for evaluating statistical evidence) that we can “rule it out” as a likely candidate for the truth.⁴³ On the other hand, if a value is in this confidence interval, we should treat it as a likely candidate for the true average treatment effect: we cannot not “rule it out” under standard heuristics for evaluating evidence.

This second reason for calculating confidence intervals is useful for making sense of statistically significant results as well. Let’s say our outcome measure is application acceptance, and we estimate that an informational intervention increases the probability that a person’s application is accepted by 0.2, on average (+20pp). The 95% confidence interval is [0.001, 0.399]. This result is statistically significant, but the interval indicates that we should consider null hypotheses ranging from a .1pp increase to a nearly 40pp increase as plausible candidates for the truth. This is a wide range for a behavioral study! We can confidently provide evidence against treatment effects as large as, say, 60pp, and we can confidently provide evidence that the treatment effect is positive (not negative or null). But we cannot confidently say whether the true effect is negligible (e.g.: 0.1pp) or substantial (e.g: 30pp).

7.1.3.3 How do we report it?

As with p-values, when we are concerned about avoiding technical detail, we generally report confidence intervals without much more interpretation. However, interpretation exercises like our example above could often be useful for helping us to decide how we frame our ATE estimates. At some point, a statistically significant confidence interval may be so wide that rather than say an estimated effect is large or small, we can only really confidently say that an effect is positive (and it would be misleading to imply otherwise). Additionally, we do sometimes provide more interpretation of the values in a confidence interval, directly or indirectly, when evaluating null results. See the next section.

7.2 Making sense of statistically insignificant results

Statistical insignificance is not necessarily sufficient to support a claim of “no treatment effect.” This is something many applied fields have come to appreciate better in recent years. Statistical insignificance MIGHT indicate a lack of a treatment effect. But it might also indicate that our study simply has too little statistical power to detect a meaningful, real effect we would care about—our minimum detectable effect (MDE) at 80% power is simply greater than the smallest effect of substantive interest (SESI). See the last section of the power chapter for a little more on those terms.

Thinking through this requires making judgements about the effect sizes that would be meaningful for our partners, or that are too small to be meaningful/actionable. Due to random noise in our ATE estimates, we will never estimate an effect of exactly 0, even when there is no treatment effect. An inherent part of evaluating statistical evidence in program evaluation is deciding whether a finding is actionable. When we see a statistically insignificant finding we need to determine whether it is more consistent with a “no meaningful effect” interpretation or an “inconclusive” interpretation. This shapes how we will frame findings for our partners.

It’s easy to think of an “inconclusive” result as a failure. But even these kinds of findings may still tell us something useful: currently available data do not let us determine whether an intervention is effective. This may imply a need for more research and data collection, or for policy implementation decisions to proceed cautiously, perhaps on some other grounds. As long as an evaluation is conducted carefully and transparently, we always learn something useful.

Below, we walk through two ways of determining which conclusion is more consistent with statistically insignificant evidence from our evaluation. Either approach is fine to use in practice. They solve the same problem. The choice comes down to what makes more sense to you (or which you think a partner might prefer).

7.2.1 Statistical power

We can approximate the MDE-80 from our statistical results using a trick outlined at the end of the last chapter—what is the smallest effect, roughly, that we would have been able to detect at 80% power ( $2.8 \times SE$ )? We can compare this to what we think the SESI is based on discussions with partners. If MDE-80 > SESI, then we likely cannot support a claim of “no meaningful effect” based on our results: there are interesting effect sizes that are simply smaller than those we can detect with sufficient power and we cannot rule them out.⁴⁴ The best we can do is use confidence intervals to determine what range of possible true effects we can rule out with 95% confidence, which helps narrow the remaining possibilities (this could still be useful for our partners).

Things are trickier when MDE-80 < SESI. Consider an evaluation of a binary outcome representing application acceptance, which we hoped to increase with an informational evaluation. Our confidence interval ranges from -0.005 to 0.015. This is insignificant. But can we say there’s no meaningful effect?

Assume we estimate an approximate MDE-80 of 0.005, a 0.5pp increase in acceptance. If our partner indicates that the SESI is 1pp, this means that we were sufficiently powered to detect effects smaller than those that our partner considers meaningful. On first glance, we might decide this supports a true null. On the other hand, our confidence interval suggests that we shouldn’t rule out values as large as +1.5pp as possible candidates for the truth, which includes substantively meaningful values (greater than the SESI). We might then conclude that while our evaluation provides evidence against increases greater than 1.5pp, and also suggests a null effect is most likely, we cannot entirely rule out small (close to the SESI) increases in award acceptance.

We suggest the following decision procedure when faced with a statistically insignificant result:

Get an estimate of MDE-80, approximated either by a preliminary power analysis or a post-hoc calculation using the standard error ( $2.8 \times SE$ )
Make a judgement call about the SESI, based either on discussions with the partner or our own thinking (e.g., based on findings of other studies)
If MDE-80 > SESI, we can’t treat our evaluation as supporting a claim of no effect at all. But looking at the 95% confidence interval helps us understand the range of possibilities we can tentative rule out.
If MDE-80 < SESI, we should lean towards a “no meaningful effect” interpretation, but we should still look at the 95% confidence interval to think this through.

Here’s a coded example where the MDE-80 < SESI, which helps us justify a “no meaningful effect” interpretation. But we want to check the confidence interval as well. In this case, the confidence interval supports a no-effect interpretation as well.

## Generate a treatment variable that is
## not associated with the outcome in any way.
set.seed(1234)
dat1$nullZ <- sample(c(rep(1,50),rep(0,50)), nrow(dat1), replace = FALSE)

## Regress the outcome on this null treatment indicator.
mod <- lm(Y ~ nullZ, data = dat1)

## Get statistical results based on HC2 errors.
ct <- coeftest(mod, vcov. = vcovHC(mod, "HC2"))
ci <- coefci(mod, vcov. = vcovHC(mod, "HC2"), level = 0.95)

## Get rough approximation of ex-post MDE.
## If, say, the SESI was 3, then this is
## greater than the MDE, which is good.
## But we still want to check the confidence interval.
print("E.g.: SESI = 3")
print(paste0("MDE80 = ", round(ct["nullZ","Std. Error"]*2.8,3) ))
print(paste0("CI = ",paste0(round(ci["nullZ",],3),collapse = " to ")))

** Generate a treatment variable that is
** not associated with the outcome in any way.
set seed 1234
gen u = runiform()
gsort u
gen nullZ = 1 in 1/50
replace nullZ = 0 if missing(nullZ)

** Regress the outcome on this null treatment indicator.
qui reg y nullZ, vce(hc2)
local ci_low = r(table)["ll","nullZ"]
local ci_up = r(table)["ul","nullZ"]

** Get rough approximation of ex-post MDE.
** If, say, the SESI was 3, then this is
** greater than the MDE, which is good.
** But we still want to check the confidence interval.
local mde = _se[nullZ]*2.8
di "E.g.: SESI = 3"
di "MDE80 = `mde'"
di "CI = `ci_low' to `ci_up'"

[1] "E.g.: SESI = 3"
[1] "MDE80 = 2.443"
[1] "CI = -1.343 to 2.12"

7.2.2 Equivalence tests

An alternative way of approaching the same problem is to use a procedure called equivalence testing (Hartman and Hidalgo 2018; Rainey 2014). The idea is to perform a formal test of the claim that our estimated treatment effect is so small that it is effectively 0. As usual in Frequentist statistics, we argue for this claim by evaluating the evidence against its opposite: that there is a meaningful treatment effect.

The logic here is easy to stumble over! When testing FOR an effect—what we normally do—we evaluate whether we see sufficient evidence against the null hypothesis of no treatment effect. But in equivalence testing our priorities are reversed: we want to determine whether we can confidently reject a null hypothesis of some meaningful effect. The underlying concern, as discussed above, is that we see a statistically insignificant result in our test FOR an effect due to insufficient power rather than because there is really no effect.

There are a few ways you can go about performing an equivalence test in practice. The most common procedure is called a two one-sided test (TOST). Hartman and Hidalgo (2018), for instance, suggest an alternative that might have better properties, but we’ll focus on the TOST procedure. It starts by choosing an equivalence region. This is a region of effect sizes that we think are so small as to be effectively 0 and not actionable for our agency partner (i.e., the SESI, on either side of 0). Again, this decision inherently requires subjective judgement calls and should usually be discussed with the partner directly. Once we have defined an equivalence region, we test the null hypothesis that the treatment effect is outside of this equivalence region—either above it (a meaningful positive treatment effect) or below it (a meaningful negative treatment effect).

As you might have guessed, we can do this by performing two one-sided tests. The first is a one-sided test of whether the treatment effect is greater than an assumed null equal to the lower bound of the equivalence region (evidence against the null of a meaningful negative effect). The second is a one-sided test of whether the treatment effect is less than an assumed null equal to the upper bound of the equivalence region (evidence against the null of a meaningful positive effect). The maximum of these is the overall p-value for the TOST procedure. If it is below 0.05, we can reject the null of a meaningful effect with 95% confidence. We show an example in the code chunk below.

We lay out what the TOST procedure is aiming to accomplish above for reference. However, in practice, there is a useful shortcut to avoid those separate p-value calculations: construct a 90% confidence interval for your treatment effect, and determine if this CI is entirely within the equivalence region. If so, a TOST would yield a p-value less than 0.05 (rejecting the null of a meaningful effect at a 95% confidence level). For more on this approach and some intuition for why it works, see Rainey (2014). We generally prefer this method of thinking through the results of an equivalence test for a few reasons: it requires minimal changes from quantities we would calculate anyway;⁴⁵ and it makes the logic underlying our conclusions more transparent. If someone reviewing our materials has a different opinion about what the equivalence region should be, they can easily compare our CI to their own preferred equivalence region instead.

Here’s a coded example, based on our example above, where we perform a TOST procedure using both methods (formal calculation and the 90% CI shortcut) to show that they support the same conclusion.

## 90% CI
ci90 <- coefci(mod, vcov. = vcovHC(mod, "HC2"), level = 0.90)

## Define equivalence region so that
## it is right on the edge of CI upper bound.
eq <- c(-1.85, 1.85)
print(paste0("90% CI = ",paste0(round(ci90["nullZ",],3),collapse = " to ")))
print(paste0("EQ region =  ",paste0(eq,collapse = " to ")))

## First TOST one-sided p-value
p1 <- (mod$coefficients["nullZ"] - eq[1])/ct["nullZ", "Std. Error"]
p1 <- pt(p1, mod$df, lower.tail = FALSE)

## Second TOST one-sided p-value
p2 <- (mod$coefficients["nullZ"] - eq[2])/ct["nullZ", "Std. Error"]
p2 <- pt(p2, mod$df, lower.tail = TRUE)

## TOST overall p-value (95% confidence).
## Borderline, since 90% CI is almost at 
## edge of equivalence region!
print(paste0("TOST p-value = ", round(max(p1,p2),3)))

** 90% CI
qui reg y nullZ, vce(hc2) level(90)
local df = e(df_r)
local ci_low = r(table)["ll","nullZ"]
local ci_up = r(table)["ul","nullZ"]

** Define equivalence region so that
** it is right on the edge of CI upper bound.
local eq_low = -2.25
local eq_high = 2.25
di "90% CI = `ci_low' to `ci_up'"
di "EQ region = `eq_low' to `eq_high'"

** First TOST one-sided p-value
local p1 = (_b[nullZ] - `eq_low')/_se[nullZ]
local p1 = ttail(`df', abs(`p1'))

** Second TOST one-sided p-value
local p2 = (_b[nullZ] - `eq_high')/_se[nullZ]
local p2 = 1 - ttail(`df', `p2')

** TOST overall p-value (95% confidence).
** Borderline, since 90% CI is almost at 
** edge of equivalence region!
local both_pvals `p1' `p2'
local max : subinstr local both_pvals " " ",", all
local max = max(`max')
di "TOST p-value = `max'"

[1] "90% CI = -1.06 to 1.837"
[1] "EQ region =  -1.85 to 1.85"
[1] "TOST p-value = 0.049"

7.3 Efficacy vs toxicity

In the early phases of a clinical trial, in addition to testing a medication’s possible efficacy, medical researchers look for evidence of an unacceptable rate of “serious adverse events.” This is sometimes referred to as an evaluation of toxicity. Even an effective medication may have unintended side effects so serious that they prevent it from ever reaching the market (or at least restrict the patients to whom it can be prescribed).

We can apply a similar idea to program evaluations in the public policy sphere. Instead of evaluating whether a policy change has positive impacts (as we would normally do), our priority might sometimes be providing evidence against negative impacts. For instance, a partner might be considering a new procedure for processing program applications that provides significant cost savings. We might help them perform an evaluation to ensure that it does not lead to a meaningful decline in acceptance rates.

Evaluating toxicity raises statistical issues similar to those raised in the equivalence testing section. Our priority here is not evaluating the efficacy of the treatment (determining whether we can confidently reject a null hypothesis of no average effect). Instead, we simply want to make sure that treatment does not cause an average decline in the outcome. In tests for toxicity, we instead determine whether we can reject the null hypothesis of a meaningful NEGATIVE treatment effect.

This corresponds to one of the two p-values we would calculate for a TOST procedure. To quote our text above: “a one-sided test of whether the treatment effect is greater than an assumed null equal to the lower bound of the equivalence region (evidence against the null of a meaningful negative effect)”. As long as the lower bound of the equivalence region is below zero, we generally cannot compute this p-value by dividing the two-sided p-value we see in our standard regression output by two.⁴⁶

When evaluating toxicity, we recommend stating this goal explicitly in our analysis plans and abstracts. This test can then be performed by either computing the relevant one-sided p-value—see the calculation of p1 in our equivalence testing code chunk above—or by constructing a 90% confidence interval as discussed in the equivalence testing section and determining whether the lower bound of this interval is within the equivalence region. In either case, as with equivalence testing, we need to make a judgement call about the difference between a “meaningful” and “not meaningful” decline in the outcome, ideally informed by discussions with our agency partners. If we do not believe it is possible to draw this distinction, or it we believe any negative effect of any sizes is unacceptable, we might default to an equivalence region with a lower bound at 0.

7.4 Cost/benefit calculations

Under construction!

References

Amrhein, Valentin, and Sander Greenland. 2022. “Rewriting Results in the Language of Compatibility.” Trends in Ecology & Evolution 37 (7): 567–68.

Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Scientists Rise up Against Statistical Significance.” Nature 567 (7748): 305–7.

Bowers, Jake, Mark M Fredrickson, and Costas Panagopoulos. 2013. “Reasoning about Interference Between Units: A General Framework.” Political Analysis 21 (1): 97–124.

Greenland, Sander. 2019. “Valid p-Values Behave Exactly as They Should: Some Misleading Criticisms of p-Values and Their Resolution with s-Values.” The American Statistician 73 (sup1): 106–14.

———. 2023. “Divergence Versus Decision p-Values: A Distinction Worth Making in Theory and Keeping in Practice: Or, How Divergence p-Values Measure Evidence Even When Decision p-Values Do Not.” Scandinavian Journal of Statistics 50 (1): 54–88.

Hartman, Erin, and F Daniel Hidalgo. 2018. “An Equivalence Approach to Balance and Placebo Tests.” American Journal of Political Science 62 (4): 1000–1013.

Holland, Paul W. 1986. “Statistics and Causal Inference (with Discussion).” Journal of the American Statistical Association 81: 945–70.

Lakens, Daniel. 2021. “The Practical Alternative to the p Value Is the Correctly Used p Value.” Perspectives on Psychological Science 16 (3): 639–48.

Lin, Winston, Scott D Halpern, Meeta Prasad Kerlin, and Dylan S Small. 2017. “A ‘Placement of Death’ Approach for Studies of Treatment Effects on ICU Length of Stay.” Statistical Methods in Medical Research 26 (1): 292–311.

McShane, Blakeley B, David Gal, Andrew Gelman, Christian Robert, and Jennifer L Tackett. 2019. “Abandon Statistical Significance.” The American Statistician 73 (sup1): 235–45.

Rainey, Carlisle. 2014. “Arguing for a Negligible Effect.” American Journal of Political Science 58 (4): 1083–91.

Using randomization inference: the percent of values in the randomization distribution that are as far from 0 as our ATE estimate (or further), assuming the null hypothesis of no effect for any unit is true. The primary difference here is the null hypothesis changing.↩︎
This does not provide beyond a shadow of the doubt that these values cannot be the truth. But in the absence of other evidence, it suggests that our attention should be directed elsewhere.↩︎
We generally use the heuristic 80% threshold for “sufficient” power in our evaluations, but there are times when we deviate from this.↩︎
By default we already calculate 95% confidence intervals. This corresponds to a TOST at a stricter 97.5% confidence level. If the values in this CI are already all within the equivalence region, there is no need to compute a separate 90% interval.↩︎
A one-sided p-value from a standard test of treatment efficacy corresponds to a null hypothesis of an effect that is zero or less. But there may be negative effect sizes we consider to be “effectively zero” because they are in the equivalence region, and so using a one-sided p-value from a standard efficacy test to evaluate toxicity might be unnecessarily conservative.↩︎