Chapter 4 Randomization choices

After working together with our agency partners to translate insights from the social and behavioral sciences into potential policy recommendations, we assess those new ideas by observing differences or changes in real world outcomes (usually measured using existing administrative data).7 In most cases, we design a randomized control trial (an RCT) to ensure that the differences or changes we observe are driven by the policy intervention itself. Here, we show examples of the different methods we consider for randomly assigning units to treatment. These form the core of the different types of RCTs that we use to build evidence about the effectiveness of the new policies.

4.1 Coin flipping vs urn-drawing randomization

Many discussions of RCTs begin by talking about the intervention being assigned to units (people, schools, offices, districts) “by the flip of a coin,” or simple random assignment. Each unit’s assignment to treatment occurs separately, and there is no ex ante guarantee as to exactly what the final number of treated or control units will be. We don’t always use this method in practice, even though it is a useful way to introduce the idea that RCTs guarantee fair access to a new policy.8

The following code contrasts coin-flipping style random assignment with drawing-from-an-urn style, or complete random assignment (where a fixed number of units are randomly chosen for treatment). Coin-flipping based experiments are still valid and tell us about the underlying counterfactuals, but they might have less statistical power, so we try to avoid them where possible.

Notice that the simple random assignment implemented in the code below results in more observations in the treatment group (group T) than in the control group (group C). Complete random assignment will always assign 5 units to the treatment, 5 to the control.


## Start with a small experiment with only 10 units
n <- 10

## Set a random seed for replicability
set.seed(12345)

## Function to add more intuitive labels
labelTC <- function(assignments) {ifelse(assignments == 1, "T", "C")}

## Simulation using functions from the randomizr package
trt_coinflip <- labelTC(simple_ra(n))
trt_urn <- labelTC(complete_ra(n))

## Coin flipping does not guarantee half and half treated and control.
## Drawing from an urn, guarantees half treated and control.
table(trt_coinflip)
table(trt_urn)

## Alternative approach using base R
# set.seed(12345)
# trt_coinflip <- labelTC(rbinom(10, size = 1, prob = .5))
# trt_urn <- labelTC(sample(rep(c(1, 0), n / 2)))
# table(trt_coinflip)
# table(trt_urn)


** Start with a small experiment with only 10 units
clear
global n = 10
set obs $n

** Set a random seed for replicability
set seed 12345

** Simulation using functions from the randomizr package
* ssc install randomizr

simple_ra trt_coinflip
* Or, e.g.: gen trt_coinflip = rbinomial(1, 0.5)

complete_ra trt_urn
/* 
* Or, e.g.:
local num_treat = $n/2
gen rand = runiform()
sort rand
gen trt_urn = 1 in 1/`num_treat'
replace trt_urn = 0 if missing(trt_urn)
drop rand
*/

** Add more informative labels
label define tc 0 "C" 1 "T"
label values trt_coinflip tc
label values trt_urn tc

** Coin flipping does not guarantee half and half treated and control.
** Drawing from an urn, guarantees half treated and control.
table trt_coinflip
table trt_urn
trt_coinflip
C T 
3 7 
trt_urn
C T 
5 5 

4.2 Randomization into 2 or more groups

Instead of manually coding assignment ourselves using the base R sample function (or an equivalent procedure in Stata), we can use the randomizr package (Coppock 2023) (available for both R and Stata) for many kinds of simpler randomization designs. An advantage of using randomizr for the examples in this SOP is that it automatically does some quality control checks, which saves us the trouble of writing out such checks ourselves for all of our coded examples.

For instance, notice that we implement a check on our code below with the stopifnot command: the code will stop and issue a warning if we didn’t actually assign 1/4 of the observations to the treatment condition. Here, we assign the units first to 2 arms with equal probability (Z2armEqual). Then, to show how the code works, we assign them to 2 arms where one arm has only a 14\frac{1}{4} probability of receiving treatment (e.g., imagine a design with an expensive intervention). Last, we assign them based on a design with 4 different arms, each with equal probability (e.g., one control group and three different treatments under consideration). We’ll generally use ZZ in this document to refer to the variable recording our intervention arms.


N <- nrow(dat1)
set.seed(12345)

## Two equal arms
dat1$Z2armEqual <- labelTC(complete_ra(N))

## Two unequal arms: .25 chance of treatment (.75 chance of control0
dat1$Z2armUnequalA <- labelTC(complete_ra(N,prob=.25))
stopifnot(sum(dat1$Z2armUnequalA=="T")==N/4)
dat1$Z2armUnequalB <- labelTC(complete_ra(N,m=N/4))

## Four equal arms
dat1$Z4arms <- complete_ra(N, m_each=rep(N/4,4))

table(Z2armEqual=dat1$Z2armEqual)
table(Z2armUnequalA=dat1$Z2armUnequalA)
table(Z2armUnequalB=dat1$Z2armUnequalB)
table(Z4arms=dat1$Z4arms)


** Return to data for the fake experiment.
import delimited using "dat1.csv", clear

qui count
global N = r(N)
set seed 12345

** Two equal arms
complete_ra z2armEqual
label define tc 0 "C" 1 "T"
label values z2armEqual tc

** Two unequal arms: .25 chance of treatment (.75 chance of control)
complete_ra z2armUnequalA, prob(0.25)
label values z2armUnequalA tc
qui sum z2armUnequalA
global expected = $N/4
assert r(sum) == $expected
complete_ra z2armUnequalB, m($expected)
label values z2armUnequalB tc

** Four equal arms
local count_list : di _dup(4) "$expected " // List of sample sizes for each group
macro list _count_list
complete_ra z4arms, m_each(`count_list')

table z2armEqual
table z2armUnequalA
table z2armUnequalB
table z4arms
Z2armEqual
 C  T 
50 50 
Z2armUnequalA
 C  T 
75 25 
Z2armUnequalB
 C  T 
75 25 
Z4arms
T1 T2 T3 T4 
25 25 25 25 

4.3 Factorial designs

It’s possible to test the effects of more than one intervention while losing less statistical power by randomly assigning multiple treatments independently of each other. The simplest design that we use for this purpose is the 2×22 \times 2 factorial design. For example, in the next table we see that we have assigned 50 observations to each arm of two separate interventions. Since the randomization of treatment1 is independent of treatment2, we can assess the effects of each treatment separately and pay less of a power penalty (unless one of the treatments dramatically increases the variance of the outcome compared to a hypothetical experiment with only one treatment assigned).


## Two equal arms, adding a second cross treatment
dat1$Z2armEqual2 <- labelTC(complete_ra(N))
table(treatment1=dat1$Z2armEqual,treatment2=dat1$Z2armEqual2)


** Two equal arms, adding a second cross treatment
complete_ra z2armEqual2
label var z2armEqual2 "Treatment 2"
label var z2armEqual "Treatment 1"
label val z2armEqual2 tc
table z2armEqual z2armEqual2
          treatment2
treatment1  C  T
         C 23 27
         T 27 23

Although factorial designs allow us to test more than one intervention at the same time, they may not provide the same degree of power when testing hypotheses about the interaction between the two treatments. If we want to learn about how two different interventions work together, then the sample size requirements will be much larger than if we were satisfied with learning about each treatment separately.9

Importantly, recent work highlights some important concerns regarding (1) the consequences of omitting interaction terms when estimating separate effects of each treatment and (2) the interpretation of factorial treatment effects (Muralidharan, Romero, and Wüthrich 2023). First, on (1), even if the interaction between treatments is not of academic or policy relevance, including it in the estimation model may be important for making correct inferences. Specifically, if the true interaction effect is not zero, excluding it from the model could increase the risk of Type I errors (i.e., false positives).

Meanwhile, on (2), consider a two-arm factorial design with two Treatments, A and B, with 25% of the sample is in each treatment condition. An estimated effect of Treatment A from a model without an interaction should be interpreted as a weighted average of the effects of A across two subsamples: those receiving B (50%), and those not receiving B (50%). This weighted average treatment effect may or may not provide useful information about the likely effects of Treatment A if it is scaled up later.10 For instance, there may be a substantial interaction with treatment B, which is rarely administered in reality and which will not be scaled up alongside A. The subgroup effect of A among “not B” is then more policy relevant, but the subgroup effect of A among “receiving B” pulls the overall estimated average effect of A away from it.

To deal with those issues, the OES Methods Team recommends estimating treatment effects in factorial experiments using a model that includes an interaction, at least as a robustness check.

4.4 Block random assignment

4.4.1 The benefits of blocking

Statistical power depends not only on the size of the experiment and the strength of the treatment effect, but also on the amount of “noise” (non-treatment-related variability) in the outcome measure. Block-randomized designs can help reduce this noise while simultaneously minimizing estimation error—the amount that our particular experiment’s estimate differs from the truth.

In a block-randomized, or stratified, design, we randomly assign units to the policy intervention within pre-specified groups.11 Suppose we are evaluating whether dedicated navigators can increase the percentage of students living in public housing who complete federal financial aid applications (FAFSA). Our experiment will send navigators to two of four eligible buildings, two of which are large and two of which are small. In a real study we can never know the outcome in all buildings both with and without navigators (the “fundamental problem of causal inference” from the last chapter). But if we could, we might have the data below:

Building Size % FAFSA (No Navigator) % FAFSA (With Navigator)
1 Large 20 60
2 Large 30 70
3 Small 20 30
4 Small 30 40
Mean 25 50

The true average treatment effect for this sample is the average under treatment (i.e., the average treated potential outcome) minus the average under control (i.e., the average control potential outcome): ATE=5025=25\text{ATE} = 50-25=25 percent more applications per building when a navigator is deployed.

In a real study, we might randomly allocate two buildings to treatment and two buildings to control. If complete random assignment led to us treating the first two buildings, then we might observe:

Building Size Treated % FAFSA (No Navigator) % FAFSA (With Navigator)
1 Large 1 60
2 Large 1 70
3 Small 0 20
4 Small 0 30
Mean 25 65

This yields a treatment effect estimate of 6525=4065-25 = 40 percent more applications due to the presence of a navigator. This is larger than the true value of 2525.

Or, if random assignment led to the other two buildings being treated instead, we might then observe:

Building Size Treated % FAFSA (No Navigator) % FAFSA (With Navigator)
1 Large 0 20
2 Large 0 30
3 Small 1 30
4 Small 1 40
Mean 25 35

This, in contrast, yields an estimated treatment effect of 3525=1035-25 = 10 percentage point more applications due to the navigators – now smaller than the true value of 2525.

All of the possible (equiprobable) assignments with two treated and two control units, along with their estimated treatment effects, are listed in the table below:

Assignments Estimated Effect
TTCC 40
CTCT 35
TCCT 25
CTTC 25
TCTC 15
CCTT 10

These possible treatment effect estimates have a mean equal to the true value of 25, illustrating the difference in means is an unbiased estimator. However, some of these estimates are far from the truth, and they have a lot of variability.

To design an experiment that better estimates the true value, and does so with more statistical power (less variability), we can randomly assign units within blocks. In general, units should be sorted into different blocks based on their similarity across one or more characteristics that we expect to be correlated with our outcome. Here, blocking implies restricting the possible random assignments to those that have one large and one small building in each treatment group:

Assignments Estimated Effect
CTCT 35
TCCT 25
CTTC 25
TCTC 15

With this blocked design restricting the random assignments that are possible, we now get an estimate that is no more than 10 percentage points from the truth. Further, our estimates will have less variability (an SD of 8.16 rather than 11.4). This improves the statistical power of our design.

For a more realistic example, suppose we are designing an experiment where the sample includes patients from two different hospitals. We might randomly assign patients to treatment and control within each hospital. For instance, we might assign half of the patients in hospital “A” to treatment and half to control, then do the same in hospital “B.”12 Below, we have 50 units in hospital “A” and 50 in hospital “B”:


dat1$blockID <- gl(n = 2, k = N/2, labels = c("Block A", "Block B"))
with(dat1,table(blockID=dat1$blockID))


local Anum = $N/2
gen blockID = .
tempvar rand
gen `rand' = runiform()
sort `rand'
replace blockID = 1 in 1/`Anum'
replace blockID = 2 if missing(blockID)
label define blocklab 1 "Block A" 2 "Block B"
label values blockID blocklab
table blockID
blockID
Block A Block B 
     50      50 

We assign half of the units in each hospital to each treatment condition:


dat1$Z2armBlocked <- labelTC(block_ra(blocks=dat1$blockID))
with(dat1, table(blockID, Z2armBlocked))


block_ra z2armBlocked, block_var(blockID) replace
label val z2armBlocked tc
table blockID z2armBlocked
capture drop __00* // Clean up block_var temporary var
         Z2armBlocked
blockID    C  T
  Block A 25 25
  Block B 25 25

If, say, there were fewer people eligible for treatment in hospital “A” — or perhaps the intervention was more expensive in that block — we might choose different treatment probabilities for each block. The code below assigns half of the hospital “A” patients to treatment, but only a quarter of those from hospital “B”. Again, we also check that this code worked. This approach is an informal version of one of the best practices for writing code in general, called “unit testing.” See the EGAP Guide to Workflow for more examples.


## Blocked assignment, unequal probability
dat1$Z2armBlockedUneqProb <- labelTC(block_ra(blocks=dat1$blockID, block_prob=c(.5,.25)))
with(dat1, table(blockID, Z2armBlockedUneqProb))

## Unit testing
NumTreatedB <- sum(dat1$Z2armBlockedUneqProb=="T" & dat1$blockID=="Block B")
ExpectedNumTreatedB <- sum(dat1$blockID=="Block B")/4
stopifnot(NumTreatedB==ceiling(ExpectedNumTreatedB) | NumTreatedB==floor(ExpectedNumTreatedB))


** Blocked assignment, unequal probability
block_ra z2armBlockedUneqProb, block_var(blockID) block_prob(0.5 0.25) replace
label val z2armBlockedUneqProb tc
table blockID z2armBlockedUneqProb
capture drop __00* // Clean up block_var temporary var

** Unit Testing
qui count if z2armBlockedUneqProb == 1 & blockID == 2
global NumTreatedB = `r(N)'
qui count if blockID == 2
global ExpectedNumTreatedB = `r(N)'/4
assert ($NumTreatedB == ceil($ExpectedNumTreatedB)) | ($NumTreatedB == floor($ExpectedNumTreatedB))
         Z2armBlockedUneqProb
blockID    C  T
  Block A 25 25
  Block B 38 12

Our team tries to implement block-randomized assignment whenever possible in order to increase the statistical power of our experiments. We also often find it useful in cases where different administrative units are implementing the treatment, or when we expect different groups of people to have different reactions to the treatment.

4.4.2 Using a few covariates to create blocks

If we have background information on a few covariates, we can create blocks by hand through a process like the one demonstrated here:


## For example, make three groups from the cov2 variable
dat1$cov2cat <- with(dat1, cut(cov2, breaks=3))
table(dat1$cov2cat, exclude=c())
with(dat1, tapply(cov2, cov2cat, summary))

## And we can make blocks that are the same on two covariates
dat1$cov1bin <- as.numeric(dat1$cov1>median(dat1$cov1)) # Binarize cov1
dat1$blockV2 <- droplevels(with(dat1, interaction(cov1bin, cov2cat)))
table(dat1$blockV2, exclude=c())

## And then assign within these blocks
set.seed(12345)
dat1$ZblockV2 <- labelTC(block_ra(blocks = dat1$blockV2))
with(dat1, table(blockV2, ZblockV2, exclude=c()))


** For example, make three groups from the cov2 variable
egen cov2cat = cut(cov2), group(3) label
table cov2cat
bysort cov2cat : sum cov2 // Divides into intervals differently from R

** And we can make blocks that are the same on two covariates
qui sum cov1, d
gen cov1bin = cond(cov1 > r(p50), 1, 0) // Similar to ifelse() in R
decode cov2cat, generate(string_cov2cat)
gen blockV2 = string(cov1bin) + " " + string_cov2cat
table blockV2

** And then assign within these blocks
set seed 12345
block_ra zblockV2, block_var(blockV2)
label val zblockV2 tc
table blockV2 zblockV2
capture drop __00* // Clean up

(-7.32,-2.6]   (-2.6,2.1]   (2.1,6.82] 
          11           68           21 
$`(-7.32,-2.6]`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -7.31   -5.03   -3.66   -4.14   -3.01   -2.77 

$`(-2.6,2.1]`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.3916 -0.7630 -0.0194 -0.0218  0.7592  2.0864 

$`(2.1,6.82]`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.13    2.46    2.95    3.24    3.68    6.80 


0.(-7.32,-2.6] 1.(-7.32,-2.6]   0.(-2.6,2.1]   1.(-2.6,2.1]   0.(2.1,6.82]   1.(2.1,6.82] 
             7              4             38             30              5             16 
                ZblockV2
blockV2           C  T
  0.(-7.32,-2.6]  4  3
  1.(-7.32,-2.6]  2  2
  0.(-2.6,2.1]   19 19
  1.(-2.6,2.1]   15 15
  0.(2.1,6.82]    2  3
  1.(2.1,6.82]    8  8

4.4.3 Blocking using many covariates

If instead we have many background variables, we can increase precision by thinking about blocking as a problem of “matching,” or creating sets of units which are as similar as possible across the entire set of covariates (Moore 2012; Moore and Schnakenberg 2016). Here we show two approaches using R.

Creating pairs:

## Using the blockTools package
mvblocks <- block(dat1, id.vars="id", block.vars=c("cov1","cov2"), algorithm="optimal")
dat1$blocksV3 <- createBlockIDs(mvblocks, data=dat1, id.var = "id")
dat1$ZblockV3 <- labelTC(block_ra(blocks = dat1$blocksV3))

## Just show the first ten pairs
with(dat1,table(blocksV3,ZblockV3,exclude=c()))[1:10,]
        ZblockV3
blocksV3 C T
      1  1 1
      2  1 1
      3  1 1
      4  1 1
      5  1 1
      6  1 1
      7  1 1
      8  1 1
      9  1 1
      10 1 1

Creating larger blocks:

## Using the quickblock package
distmat <- distances(dat1, dist_variables = c("cov1bin", "cov2"), id_variable = "id", normalize="mahalanobiz")
distmat[1:5,1:5]
       1      2      3       4       5
1 0.0000 1.0697 0.3453 0.14026 0.14827
2 1.0697 0.0000 1.4150 0.92947 0.92146
3 0.3453 1.4150 0.0000 0.48555 0.49356
4 0.1403 0.9295 0.4856 0.00000 0.00801
5 0.1483 0.9215 0.4936 0.00801 0.00000
quantile(as.vector(distmat), seq(0,1,.1))
      0%      10%      20%      30%      40%      50%      60%      70%      80%      90%     100% 
-3.13279 -0.60902 -0.22924 -0.02745  0.09799  0.28989  0.68065  1.23306  1.76838  2.00726  2.91687 
## The caliper argument helps prevent ill-matched points
mvbigblock <- quickblock(distmat, size_constraint = 6L, caliper = 2.5)

## Look for missing points
table(mvbigblock,exclude=c()) # One point dropped due to caliper
mvbigblock
   0    1    2    3    4    5    6    7    8    9   10   11   12   13 <NA> 
   7    6    6    6    6    6    7    7    8    8    9    7    6   10    1 
dat1$blocksV4 <- mvbigblock
dat1$notblocked <- is.na(dat1$blocksV4) 
dat1$ZblockV4[dat1$notblocked==F] <- labelTC(block_ra(blocks = dat1$blocksV4))
with(dat1, table(blocksV4, ZblockV4, exclude=c()))[1:10,]
        ZblockV4
blocksV4 C T <NA>
       0 4 3    0
       1 3 3    0
       2 3 3    0
       3 3 3    0
       4 3 3    0
       5 3 3    0
       6 4 3    0
       7 4 3    0
       8 4 4    0
       9 4 4    0

It’s worth pausing to examine the differences within blocks. We’ll focus on the proportion of people in category “1” on the binary covariate (notice that the blocks are homogeneous on this covariate), as well as the difference between the largest and smallest value of the continuous covariate. This table also illustrates that, due to our use of a caliper when calling quickblock above, one observation was not included in treatment assignment.

blockingDescEval <- dat1 %>% 
  group_by(blocksV4) %>%
  summarize(
    cov2diff = max(abs(cov2)) - min(abs(cov2)),
    cov1 = mean(cov1bin),
    count_in_block = n()
    )

blockingDescEval
# A tibble: 15 × 4
   blocksV4   cov2diff  cov1 count_in_block
   <qb_blckn>    <dbl> <dbl>          <int>
 1  0            4.54      0              7
 2  1            0.417     0              6
 3  2            3.13      1              6
 4  3            0.251     0              6
 5  4            1.16      1              6
 6  5            0.986     0              6
 7  6            0.537     1              7
 8  7            3.63      0              7
 9  8            0.980     0              8
10  9            0.966     1              8
11 10            0.691     1              9
12 11            0.714     1              7
13 12            0.834     1              6
14 13            0.671     0             10
15 NA            0         1              1

4.4.4 Disadvantages

Block-randomized assignment and analysis can help reduce both estimation error and non-treatment-related variability in our data. It may also be useful for ensuring equal distribution of treatment arms within less common subgroups in our sample, which can be especially important for preserving our statistical power when estimating heterogenous treatment effects.

However, there are some practical disadvantages to consider. Block-randomized assignment can make treatment administration more complicated, both in terms of implementation by our partners and determining how we should incorporate blocking into our estimation strategy. Especially when a project needs to be rolled out on a short timeline, developing a complex blocking scheme might not be realistic.

This concern in mind, remember that the benefits of blocking are not always worth the extra effort it entails. Adjusting for prognostic (i.e., correlated with YY) covariates during the analysis stage may provide sufficient improvements in precision, especially in cases where we expect a design to already be reasonably powered under simple or complete random assignment and where we do not expect subgroup analyses (particularly for rare subgroups) to be a key component of an evaluation. On the other hand, in smaller samples without block-randomized assignment, there is a greater risk of increasing variance when adjusting for non-prognostic covariates (Miratrix, Sekhon, and Yu 2013). If we think we’ll need to use covariates to improve power in a small sample, blocking might be more important.

4.5 Cluster random assignment

We often implement a new policy intervention at the level of some group of people — like a doctor’s practice, or a building, or some other administrative unit. Even though we have 100 units in our example data, imagine now that they are grouped into 10 buildings, and the policy intervention is at the building level. Below, we assign 50 of those units to treatment and 50 to control. Everyone in each building has the same treatment assignment.


## Make an indicator for cluster membership
ndat1 <- nrow(dat1)
dat1$buildingID <- rep(1:(ndat1/10), length=ndat1)
set.seed(12345)
dat1$Zcluster <- labelTC(cluster_ra(cluster=dat1$buildingID))
with(dat1, table(Zcluster, buildingID))


** Make an indicator for cluster membership
qui count
global ndat1 = r(N)
egen buildingID = seq(), from(1) to(10) block(1)
set seed 12345
cluster_ra zcluster, cluster_var(buildingID)
table zcluster buildingID
        buildingID
Zcluster  1  2  3  4  5  6  7  8  9 10
       C 10  0 10 10  0  0  0  0 10 10
       T  0 10  0  0 10 10 10 10  0  0

Cluster randomized designs raise new questions about estimation, testing, and statistical power. We describe our approaches to estimation and power analysis of cluster randomized designs in the chapter on analysis decisions.

4.6 Other randomized designs

In the past, our team has also employed stepped-wedge style designs, saturation designs aimed at discovering whether the effects of the experimental intervention are communicated across people (via some spillover or network mechanism), and designs where we try to isolate certain experimental units (like buildings) from each other so that we can focus our learning about the effects of the intervention in isolation (rather than the effects when people can communicate with each other about the intervention). There are a variety of more specialized randomization designs that may be appropriate for particular projects, and the options discussed above should not be treated as exhaustive. We may expand on some of these other randomization options here in the future.

4.7 As-if random assignment

In some circumstances, we might judge that randomly assigning a treatment of interest would be logistically infeasible. There are a variety of methods we have applied in such cases in the past to ensure assignment to treatment is at least idiosyncratic or arbitrary. What is key here is not really that assignment is random, per se. Instead, assignment must be conditionally independent of a unit’s potential outcomes (Holland 1986). Random assignment is the best way of guaranteeing this. But sometimes there are available methods of assigning treatment non-randomly that we decide are likely to satisfy this condition. Of course, relying on any as-if random assignment procedure makes it especially important to review evidence of appropriate treatment administration afterwards (see the next section).

We list examples of a few as-if random assignment procedures we have employed below, with each linking to an Analysis Plan pre-registered on the OES website for more context. But note that an as-if random procedure that is appropriate in one study may not be in another. This needs to be determined on a case-by-case basis. When possible, it may be ideal to layer two arbitrary assignment procedures on top of each other rather than rely on the plausibility of only one.

4.8 Assessing randomization (balance testing)

If we have covariates, we can evaluate the implementation of random assignment by exploring covariate differences across treatment arms.13 The ultimate goal is to ensure that our data seem consistent with our intended randomization strategy (Imai, King, and Stuart 2008). Since we want to draw conclusions about the realized sample here and not some broader population, it’s important to think carefully about what we hope to learn from statistical significance tests when evaluating balance. That issue in mind, though this varies across projects, we often find it valuable to evaluate balance using what we call “omnibus tests” relying on randomization inference (see chapter 3)—even if the evaluation itself will not ultimately rely on randomization inference. We lay out our reasoning in the next few subsections.

4.8.1 Separate tests for each covariate

To see some problems that traditional significance testing raises when evaluating balance, consider the common practice of performing separate difference-in-means tests for each covariate. This is reasonable in some cases, but it could also raise “multiple testing” concerns (see Chapter 5 for more on this). Briefly, it would be easy to discover one or a few covariates with noticeable mean differences due to random chance rather than real implementation problems. That is, in a well-operating experiment relying on standard inference procedures, we would expect some baseline imbalances for individual covariates—roughly 5 in 100.

The relationship between sample size and statistical significance is also important to remember. If a sample is too small, large imbalances (indicating real problems with treatment administration) may still be statistically insignificant. In other words, the balance test is too underpowered to be informative. And on the other hand, when working with large samples, negligible mean differences may still be significant, leading to spurious conclusions of imbalance. (This issue is relevant, to some degree, for any randomization assessment relying on significance testing, not just separate difference-in-means tests.)

When comparing individual covariates across treatment arms is necessary, one strategy for dealing with these issues is to rely on equivalence tests, such as the “Two One-Sided Test” (TOST) procedure (Hartman and Hidalgo 2018; Rainey 2014).14 Another option is to evaluate statistics that are less sensitive to sample size like standardized mean differences (e.g., Cohen’s dd) and variance ratios. Here, it may be useful for transparency to pre-register what would represent evidence of meaningful treatment administration problems.15 A common heuristic, borrowing from the matching literature, is that values of Cohen’s dd greater than around 0.2 or 0.25, or variance ratios greater than 2 and less than 0.5, are more likely to represent a meaningful imbalance (Stuart 2010). But these are not hard and fast rules.

4.8.2 Omnibus tests

Instead of making separate comparisons for each covariate, we can instead judge whether a sample appears sufficiently incompatible with the joint (or “omnibus”) null hypothesis of no average covariate differences between treatment arms. This helps sidestep the multiple testing concern noted above. This kind of test is often implemented in practice by OLS regressing a treatment indicator on a set of covariates and then calculating a p-value based on the model’s overall F-statistic. Failing to find a significant difference in an omnibus F-test does not “prove” that there are no imbalances. But finding a statistically significant difference tells you that a closer look at covariate balance is needed.

An omnibus test can also be performed using a Wald test. The idea is to compare an unrestricted model regressing treatment on the set of covariates with a restricted model including only an intercept. The Wald test then provides evidence on the plausibility of the null hypothesis that the unrestricted model does not fit the data appreciably better. This method is useful, for instance, when estimating balance while adjusting for block fixed effects. Here, both models would include block fixed effects, and the restricted model would simply drop the covariates (but is no longer be intercept-only). A Wald test more easily accommodates standard error modifications (e.g.: HC2) than similar alternatives like a likelihood ratio (LR) test.

Omnibus balance tests are often conducted in the common sampling-based inference framework. But design-based inference, and randomization-inference in particular, is often more appropriate for balance-testing. Failing to reject the joint null of no imbalance when using randomization inference implies that observed imbalances are not sufficiently distinguishable from the imbalance estimates we’d typically expect in a well-run experiment. This is exactly what we hope to learn from balance checks! Moreover, omnibus tests based on randomly permuting treatment may control the Type-I error rate (limiting spurious conclusions that a sample is imbalanced) in small-to-moderate samples better (Kerwin, Rostom, and Sterck 2024; Hansen and Bowers 2008). See Hansen and Bowers (2008) for more discussion and an introduction to an alternative omnibus balance statistic (d2)(d^{2}).

4.8.3 Summary

We typically prefer to perform some form of omnibus balance test (ideally using randomization inference), and to write out a brief plan for exploring individual imbalances further if this test rejects the null hypothesis. Time-permitting, more extensive balance checks are always valuable.

It can be important for balance testing procedures to account for any clustering or blocking employed when assigning treatment (Hansen and Bowers 2008). For blocking, this might imply calculating balance within blocks. For clustering, this may imply exploring whether a comparison of cluster-level aggregates (rather than a comparison of individual units) provides evidence of treatment administration problems. Or it might imply just accounting for within-cluster dependence when performing significance tests.

Modifications like this are straightforward when estimating an omnibus balance statistic using a regression model (examples of each are below). But they can also be incorporated into covariate-by-covariate statistics. For instance, in a blocked design, it may be reasonable to calculate separate standardized mean differences within each block (dividing within-block differences by the overall sample standard deviation) and then take a weighted average across blocks (where block-level estimates are weighted by their share of the overall sample size).

4.8.4 Coded examples

We provide R and Stata examples of several procedures discussed above. In all cases, we evaluate balance for two covariates—cov1 and cov2—under one of the randomized designs discussed earlier in this chapter: complete randomization (Z2armEqual), clustered randomization (Zcluster), or blocked randomization (ZblockV2). First, we illustrate methods for calculating standardized mean differences (SMDs; mean difference divided by pooled SD) and variance ratios (VRs) for each covariate under different randomization designs. Although there are canned packages in R (e.g., cobalt) and Stata (e.g., tebalance) that can perform some of these calculations (and in practice it’s worth exploring these options first), we’ve written out some template code for reference that performs them all manually.


## List covariates to evaluate
covlist <- c("cov1", "cov2")

## Define convenience function to perform calculations.
## May be applied to the whole dataset, cluster-level aggregates,
## or individual blocks. Assumes only 2 arms. See use below.
bstats_2arm <- function(
  d = dat1, # data to calculate balance within
  n = NULL, # total sample size; only use when d is a subset (blocking)
  tr = "Z2armEqual", # treatment var name
  y = "cov2", # evaluate balance for what covariate?
  psd = sd(dat1$cov2) # what sd to use in SMDs?
  ) {
  
  # Check whether a total sample size was specified.
  # Default is to impute this from d (assume d is not a subset).
  if (is.null(n)) {
    n <- nrow(d)
  }
  
  # Get levels of treatment var, always decreasing order
  lvl <- unique(d[,tr])
  lvl <- sort(lvl, decreasing = T)
  
  # Difference in means
  dm <- mean(d[d[,tr]==lvl[1],y], na.rm = T) - 
    mean(d[d[,tr]==lvl[2],y], na.rm = T)
  
  # Standardize using provided SD
  smd <- dm/psd
  
  # Variance ratio
  vr <- var(d[d[,tr]==lvl[1],y], na.rm = T) / 
    var(d[d[,tr]==lvl[2],y], na.rm = T)
  
  # Proportion of total sample in subset
  # (1 if n was not set)
  prop <- nrow(d)/n
  
  # Prepare output
  out <- list(
    diff_means = dm,
    smd = smd,
    vr = vr,
    prop = prop
    )
  
  return(out)
  
}

## Iterate through this covariate list, building a table
btab <- data.frame(var = covlist)
for (cov in covlist) {

  # approach 1: overall mean differences and SMDs
  sel <- btab$var == cov
  pooled <- sd(dat1[,cov])
  overall <- bstats_2arm(y = cov, psd = pooled)
  btab$diff[sel] <- overall$diff_means
  btab$smd[sel] <- overall$smd
  btab$vratio[sel] <- overall$vr 
  
  # approach 2: cluster-level mean differences and SMD.
  # approach 1 might be applied for clustered designs as well.
  by_groups <- list(dat1$buildingID, dat1$Zcluster)
  clust <- aggregate(dat1[,cov], by = by_groups, mean)
  clust <- bstats_2arm(d = clust, tr = "Group.2", y = "x", psd = pooled)
  btab$cluster_diff[sel] <- clust$diff_means
  btab$cluster_smd[sel] <- clust$smd
  btab$cluster_vratio[sel] <- clust$vr
  
  # approach 3: weighted avg of in-block mean differences and SMDs.
  blocked <- by(
    data = dat1, 
    INDICES = dat1$blockV2,
    FUN = function(x) {
      bstats_2arm(x, nrow(dat1), "ZblockV2", cov, pooled)
      },
    simplify = F
    )
  smd_block <- sapply(blocked, function(x) x$smd)
  dm_block <- sapply(blocked, function(x) x$diff_means)
  vr_block <- sapply(blocked, function(x) x$vr)
  prop_block <- sapply(blocked, function(x) x$prop)
  btab$block_diff[sel] <- weighted.mean(dm_block,prop_block)
  btab$block_smd[sel] <- weighted.mean(smd_block,prop_block)
  btab$block_vratio[sel] <- weighted.mean(vr_block,prop_block)
  
}

## View output
btab[, c(1, 3:4, 6:7, 9:10)]


** List covariates to evaluate
global covlist cov1 cov2

** Define convenience program to perform calculations.
** May be applied to the whole dataset, cluster-level aggregates,
** or individual blocks. Assumes only 2 arms. See use below.
capture program drop bstats_2arm
program define bstats_2arm, rclass sortpreserve byable(onecall)

    * varname = evaluate balance for what covariate?
    * tr = treatment var name
    * n = total sample size; only use when applied to a subset (Blocks)
    syntax varname [if] [in], tr(varname) ///
        [ n(string) psd(string) ttestopt(string) ]
    
    * Get list of by vars, passed from "by varlist:" or "bysort varlist:"
    local by "`_byvars'"
    
    * Identify the correct sample to use (if/in)
    marksample touse
    
    * Further by var setup.
    * Confirm something was passed.
    capture confirm variable `by'
    
    * If not, treat all obs as in same group.
    if _rc != 0 {
        tempvar group
        qui gen `group' = 1 if `touse'  
    }
    
    * If something was, set this up as a grouping var.
    else {      
        tempvar group
        qui egen `group' = group(`by') if `touse'       
    }
    
    * In either case, get the levels as a macro.
    qui levelsof `group' if `touse', local(by_levels)
    local len : word count `by_levels'
    
    * Make rownames, used below
    local rownames
    foreach l of local by_levels {
        local rownames `rownames' "`l'"
    }
    
    * Get n as sample size in memory, if not specified
    if "`n'"=="" {
        qui count if `touse'
        local n = r(N)
    }
    
    * Get psd as pooled sd of varname, if not specified.
    * Though varname indicated above, still mapped to `varlist'.
    if "`psd'"=="" {
        qui sum `varlist' if `touse', d
        local psd = r(sd)
    }

    * Loop through those levels, calculating
    * desired stats and saving in a matrix.
    matrix stats = J(`len', 5, .)
    matrix rownames stats = `rownames'
    matrix colnames stats = "dm" "smd" "vr" "prop" "group_n"
    local i = 0
    foreach l of local by_levels {
        
        * Update index
        local ++i
        
        * Difference in means (treatment level 2 is greater value)
        qui ttest `varlist' if `touse' & `group' == `l', by(`tr') `ttestopt'
        local dm = r(mu_2) - r(mu_1)
        matrix stats[`i', 1] = `dm'
        
        * Standardize using provided SD
        local smd = `dm'/`psd'
        matrix stats[`i', 2] = `smd'
        
        * Variance ratio
        qui ttest `varlist' if `touse' & `group' == `l', by(`tr') `ttestopt'
        local vr = (r(sd_2)^2)/(r(sd_1)^2)
        matrix stats[`i', 3] = `vr'
        
        * Proportion of total sample in subset
        * (1 if n was not set)
        qui count if `touse' & `group' == `l'
        local prop = r(N)/`n'
        local group_n = r(N)
        matrix stats[`i', 4] = `prop'
        matrix stats[`i', 5] = `group_n'
            
    }
    
    * Prepare output
    return matrix stats = stats

end

** Iterate through this covariate list, building a table
local clen : word count $covlist
matrix btab = J(`clen', 6, .)
local rownames
foreach g of global covlist {
    local rownames `rownames' "`g'"
}
matrix rownames btab = `rownames'
matrix colnames btab = "smd" "vratio" "cluster_smd" "cluster_vratio" "block_smd" "block_vratio"
local k = 0
foreach var of varlist $covlist {
    
    * approach 1: overall mean differences and SMDs
    local ++k
    qui sum `var', d
    local pooled = r(sd)
    bstats_2arm `var', tr(z2armequal) psd(`pooled')
    matrix btab[`k', 1] = r(stats)[1,"smd"]
    matrix btab[`k', 2] = r(stats)[1,"vr"]
    
    * approach 2: cluster-level mean differences and SMD.
    * approach 1 might be applied for clustered designs as well.
    preserve
        qui collapse (mean) cov1 cov2 (first) zcluster, by(buildingid)
        bstats_2arm `var', tr(zcluster) psd(`pooled') ttestopt("reverse")
    restore
    matrix btab[`k', 3] = r(stats)[1,"smd"]
    matrix btab[`k', 4] = r(stats)[1,"vr"]
    
    * approach 3: weighted avg of in-block mean differences and SMDs.
    preserve
        bysort blockv2: bstats_2arm `var', tr(zblockv2) psd(`pooled') n(100)
        qui clear
        qui svmat r(stats), names(col)
        qui sum smd [iw=prop]
        local blocked_smd = r(mean)
        qui sum vr [iw=prop]
        local blocked_vr = r(mean)
    restore
    matrix btab[`k', 5] = `blocked_smd'
    matrix btab[`k', 6] = `blocked_vr'
    
}

** View output
matrix list btab
   var      smd vratio cluster_smd cluster_vratio block_smd block_vratio
1 cov1 0.085873 0.6287     0.12226         0.6578  -0.08384        1.521
2 cov2 0.007977 0.7846     0.02615         2.8861   0.13802        2.026

Applying common heuristics that SMDs should be less than about 0.25 while VRs should be between about 0.5 and 2 (Stuart 2010), the R output for the completely randomized design doesn’t show sufficient evidence of imbalance, though the variance ratios come close (and may therefore be worth a closer look). Next, in the clustered design, for illustration, we assess balance by comparing cluster-level aggregates (whether this is the right approach in a real study with clustered assignment depends on context). Here, the variance ratios are more obviously concerning. This makes sense given that treatment was assigned at the level of only 10 clusters. Lastly, for the blocked design (where balance is assessed within each block, and then this is aggregated across blocks), recall that blocks were assigned based on first dividing cov2 into three groups and then determining whether cov1 was above it’s median. Both groupings are relatively coarse, leading blocked randomization to also perform more poorly in terms of variance ratios.

Again, it depends whether it is sufficient to evaluate balance in a blocked/clustered design as if complete random assignment were used, or whether it is important to do more to adapt the balance test to the randomization design. But evaluating overall balance in these settings can yield misleading conclusions (Hansen and Bowers 2008).

Next, we illustrate methods for performing an omnibus Wald test under each randomization design, with examples of inference based on either standard asymptotic approximations or based on permutations of treatment assignment (i.e., randomization inference). Again, while there are canned packages to perform tasks like simulating the randomization distribution (see Chapter 5), we perform this manually for illustration, since OES projects often encounter cases where canned packages don’t quite do what we need.

Eagle-eyed readers might notice that, in Chapter 3, we calculated two-sided randomization inference p-values by comparing the absolute value of our test statistic (in that case an average treatment effect) with the absolute value of its simulated distribution under permutations of treatment. Here, because the F-statistic a Wald test yields can only be positive (and not negative), we calculate a one-sided randomization inference p-value to more closely represent how the analytic (non-randomization-inference) Wald-test p-values are calculated. To do this, we just drop the part where we first calculate absolute values of all the statistics. This doesn’t change the results, since there are no negative values anyway. But with a test statistic that can be negative, like an average treatment effect, this is how you might perform a one-sided randomization inference test.


## Prepare data for this exercise
dat1_ <- dat1
dat1_$Z2armEqual <- ifelse(dat1_$Z2armEqual == "T", 1, 0)
dat1_$Zcluster <- ifelse(dat1_$Zcluster == "T", 1, 0)
dat1_$ZblockV2 <- ifelse(dat1$ZblockV2 == "T", 1, 0)
dat1_$blockV2 <- as.factor(dat1_$blockV2)

## Complete RA omnibus balance check
bmod0 <- lm(Z2armEqual ~ cov1 + cov2, data=dat1_)
bmod1 <- lm(Z2armEqual ~ 1, data=dat1_)
wald <- waldtest(bmod0, bmod1, vcov = vcovHC(bmod0, type = "HC2"))

## Repeat using randomization inference
ri_dist <- sapply(
  1:1000,
  function(.x) {
    dat1_$Z2armEqual_ri <- complete_ra(nrow(dat1_))
    unrestr <- lm(Z2armEqual_ri ~ cov1 + cov2, data=dat1_)
    restr <- lm(Z2armEqual_ri ~ 1, data=dat1_)
    waldtest(unrestr, restr, vcov = vcovHC(unrestr, type = "HC2"))$F[2]
    }
  )
wald$ri_p <- c(NA, mean(wald$F[2] <= ri_dist))

## Cluster RA omnibus balance check (error adjust only)
bmod0b <- lm(Zcluster ~ cov1 + cov2, data=dat1_)
bmod1b <- lm(Zcluster ~ 1, data=dat1_)
waldb <- waldtest(bmod0b, bmod1b, vcov = vcovCL(bmod0b, cluster = dat1_$buildingID, type = "HC2"))

## Repeat using randomization inference
ri_dist <- sapply(
  1:1000,
  function(.x) {
    dat1_$Zcluster_ri <- cluster_ra(cluster=dat1_$buildingID)
    unrestr <- lm(Zcluster_ri ~ cov1 + cov2, data=dat1_)
    restr <- lm(Zcluster_ri ~ 1, data=dat1_)
    waldtest(unrestr, restr, vcov = vcovCL(unrestr, cluster = dat1_$buildingID, type = "HC2"))$F[2]
    }
  )
waldb$ri_p <- c(NA, mean(waldb$F[2] <= ri_dist))

## Cluster RA balance check: Wald test
by_groups <- list(dat1_$buildingID, dat1_$Zcluster)
clust <- aggregate(dat1_[,covlist], by = by_groups, mean)
bmod2 <- lm(Group.2 ~ cov1 + cov2, data=clust)
bmod3 <- lm(Group.2 ~ 1, data=clust)
wald_cl <- waldtest(bmod2, bmod3, vcov = vcovHC(bmod2, type = "HC2"))

## Repeat using randomization inference
ri_dist <- sapply(
  1:1000,
  function(.x) {
    dat1_$Zcluster_ri <- cluster_ra(cluster=dat1_$buildingID)
    by_groups <- list(dat1_$buildingID, dat1_$Zcluster_ri)
    clust_ri <- aggregate(dat1_[,covlist], by = by_groups, mean)
    unrestr <- lm(Group.2 ~ cov1 + cov2, data=clust_ri)
    restr <- lm(Group.2 ~ 1, data=clust_ri)
    waldtest(unrestr, restr, vcov = vcovHC(unrestr, type = "HC2"))$F[2]
    }
  )
wald_cl$ri_p <- c(NA, mean(wald_cl$F[2] <= ri_dist))

## Blocked RA omnibus balance check
bmod4 <- lm(ZblockV2 ~ cov1 + cov2 + blockV2, data=dat1_)
bmod5 <- lm(ZblockV2 ~ blockV2, data=dat1_)
wald_bl <- waldtest(bmod4, bmod5, vcov = vcovHC(bmod4, type = "HC2"))

## Repeat using randomization inference
ri_dist <- sapply(
  1:1000,
  function(.x) {
    dat1_$ZblockV2_ri <- block_ra(blocks = dat1$blockV2)
    unrestr <- lm(ZblockV2_ri ~ cov1 + cov2 + blockV2, data=dat1_)
    restr <- lm(ZblockV2_ri ~ blockV2, data=dat1_)
    waldtest(unrestr, restr, vcov = vcovHC(unrestr, type = "HC2"))$F[2]
    }
  )
wald_bl$ri_p <- c(NA, mean(wald_bl$F[2] <= ri_dist))

## Organize
omnibus <- data.frame(
  design = c("complete RA", "clustered RA (SE only)", "clustered RA", "blocked RA"),
  wald_p = c(wald$`Pr(>F)`[2], waldb$`Pr(>F)`[2], wald_cl$`Pr(>F)`[2], wald_bl$`Pr(>F)`[2]),
  ri_p = c(wald$ri_p[2], waldb$ri_p[2], wald_cl$ri_p[2], wald_bl$ri_p[2])
  )
omnibus$wald_p <- round(omnibus$wald_p, 3)
omnibus$ri_p <- round(omnibus$ri_p, 4)

## Review
omnibus


** Prepare data for this exercise, save prior data
tempfile restore
save `restore', replace
local treatlist z2armequal zcluster zblockv2
foreach l of local treatlist {
    replace `l' = cond(`l' == "T", "1", "0")
    destring `l', replace
}

** Complete RA omnibus balance check
qui reg z2armequal cov1 cov2, vce(hc2)
test cov1=cov2=0
local waldp = r(p)
local waldf = r(F)

** Repeat using randomization inference
capture program drop ri_draw
program define ri_draw, rclass
    capture drop riZ
    complete_ra riZ
    qui reg riZ cov1 cov2, vce(hc2)
    qui test cov1=cov2=0
    return scalar ri_F = r(F)
end
preserve
    simulate ///
    ri_F = r(ri_F), ///
    reps(1000) nodots: ///
    ri_draw
    gen equal_or_greater = abs(ri_F) >= abs(`waldf')
    qui sum equal_or_greater, meanonly
    local waldp_ri = r(mean)
restore

** Cluster RA omnibus balance check (error adjust only)
* (Note: this is CR1, not CR2)
qui reg zcluster cov1 cov2, cluster(buildingid)
qui test cov1=cov2=0
local waldb_p = r(p)
local waldb_f = r(F)

** Repeat using randomization inference
capture program drop ri_draw
program define ri_draw, rclass
    capture drop riZ
    qui sum zcluster
    local zsum = r(sum)
    cluster_ra riZ, cluster_var(buildingid)
    qui reg riZ cov1 cov2, cluster(buildingid)
    qui test cov1=cov2=0
    return scalar ri_F = r(F)
end
preserve
    simulate ///
    ri_F = r(ri_F), ///
    reps(1000) nodots: ///
    ri_draw
    gen equal_or_greater = abs(ri_F) >= abs(`waldb_f')
    qui sum equal_or_greater, meanonly
    local waldb_p_ri = r(mean)
restore

** Cluster RA balance check: Wald test
preserve
    collapse (mean) cov1 cov2 (first) zcluster, by(buildingid)
    qui reg zcluster cov1 cov2, vce(hc2)
    qui test cov1=cov2=0
restore
local wald_clp = r(p)
local wald_clf = r(F)

** Repeat using randomization inference
capture program drop ri_draw
program define ri_draw, rclass
    capture drop riZ
    qui sum zcluster
    local zsum = r(sum)
    cluster_ra riZ, cluster_var(buildingid)
    preserve
        collapse (mean) cov1 cov2 (first) riZ, by(buildingid)
        qui reg riZ cov1 cov2, vce(hc2)
        qui test cov1=cov2=0
    restore
    return scalar ri_F = r(F)
end
preserve
    simulate ///
    ri_F = r(ri_F), ///
    reps(1000) nodots: ///
    ri_draw
    gen equal_or_greater = abs(ri_F) >= abs(`wald_clf')
    qui sum equal_or_greater, meanonly
    local wald_clp_ri = r(mean)
restore

** Blocked RA omnibus balance check
encode blockv2, gen(fblockv2)
qui reg zblockv2 cov1 cov2 i.fblockv2, vce(hc2)
test cov1=cov2=0
local wald_blp = r(p)
local wald_blf = r(F)

** Repeat using randomization inference
capture program drop ri_draw
program define ri_draw, rclass
    capture drop riZ
    block_ra riZ, block_var(blockv2)
    qui reg riZ cov1 cov2 i.fblockv2, vce(hc2)
    qui test cov1=cov2=0
    return scalar ri_F = r(F)
end
preserve
    simulate ///
    ri_F = r(ri_F), ///
    reps(1000) nodots: ///
    ri_draw
    gen equal_or_greater = abs(ri_F) >= abs(`wald_blf')
    qui sum equal_or_greater, meanonly
    local wald_blp_ri = r(mean)
restore

** Organize
matrix omnibus = J(4, 2, .)
matrix rownames omnibus = "complete RA" "clustered RA (se only)" "clustered RA" "blocked RA"
matrix colnames omnibus = "wald_p" "ri_p"
matrix omnibus[1,1] = round(`waldp', 0.001)
matrix omnibus[1,2] = round(`waldp_ri', 0.001)
matrix omnibus[2,1] = round(`waldb_p', 0.001)
matrix omnibus[2,2] = round(`waldb_p_ri', 0.001)
matrix omnibus[3,1] = round(`wald_clp', 0.001)
matrix omnibus[3,2] = round(`wald_clp_ri', 0.001)
matrix omnibus[4,1] = round(`wald_blp', 0.001)
matrix omnibus[4,2] = round(`wald_blp_ri', 0.001)

** Review
matrix list omnibus
                  design wald_p  ri_p
1            complete RA  0.907 0.886
2 clustered RA (SE only)  0.680 0.704
3           clustered RA  0.834 0.814
4             blocked RA  0.242 0.261

In these examples, both standard asymptotic and randomization-based inference yield similar conclusions in terms of failing to reject the joint null of overall imbalance (where “overall imbalance” is quantified using the F-statistic from a Wald test). Given the small sample size in this example, though we do not see sufficient evidence to reject the joint null, concerns that the omnibus test could be under-powered might still support evaluating covariate-by-covariate statistics as we do above for supplementary evidence.

Finally, while we do not report results in text, we also illustrate how to perform the d2d^2 omnibus test mentioned above for a few different randomization designs. This is implemented in R using the function balanceTest() from the package RItools (Hansen and Bowers 2008; Bowers, Fredrickson, and Hansen 2016). In Stata, this is implemented through a command that calls the R package and applies it to the data in memory (so it still requires an R installation on the user’s computer). The function itself provides significance test results based on a large-sample approximation to the distribution of d2d^2 statistics across many randomizations, but its randomization distribution could be computed simulated as in our examples above. You can think of the d2d^2 test statistic as a summary measure of mean differences across each covariate that is also sensitive to imbalances in combinations of covariates.


## Complete RA
balanceTest(Z~cov1+cov2, data=dat1)

## Blocked RA
balanceTest(ZblockV3~cov1+cov2+strata(blocksV3), 

## Blocked RA
balanceTest(ZblockV3~cov1+cov2+strata(blocksV3)+cluster(clusterID), data=dat1)


* ssc install xbalance.
* See "help xbalance" for additional Stata setup instructions.
* This is calling the RItools R package, so you will need R installed.
* The necessary path to Rterm.exe may look something like this:
global Rterm_path "C:\Program Files\R\R-4.2.1\bin\x64\Rterm.exe"

** Complete RA
gen single_block = 1 // To force only unstratified balance testing
label val zblockV2 // Remove value labels first
label val z
xbalance z single_block cov1 cov2

** Block RA
* zblockV2 instead of zblockV3
* zblockV3 is generated using blockTools, with no Stata equivalent
xbalance zblockV2 blockV2 cov1 cov2 

4.8.5 What to do with “failed” randomization assessments?

Observing a pp-value of less than .05 in an omnibus test ought to trigger extra scrutiny about implementation and/or how the data were recorded. For example, we might respond by contacting our agency partner to learn more about how random numbers were generated or how random assignment code was used (particularly if we didn’t perform the random assignment ourselves). In many circumstances, this follow-up investigation might suggest that random assignment was implemented correctly, and that our understanding of the design or the data was simply incorrect (i.e., the balance test was not performed correctly). But sometimes, a follow-up investigation may not turn up any misunderstandings at all. In those situations, we will need to determine whether our rejection of the null hypothesis of appropriate random assignment is a false positive, or evidence of a more systematic problem.

If our rejection of the null hypothesis appears to be driven by one or more covariates that are substantively important—say, the variable age looks very imbalanced between treated and control groups in a health-related randomized trial—then we might present both the unadjusted results and a separate set of results that adjust for the covariate(s) in question (e.g., through a stratified difference-in-means estimator, or by using them as controls in a linear regression). Large differences between the adjusted and unadjusted estimates might help us interpret our findings: estimating separate effects within different age groups, for example, might tell us something useful about the particular context of a study and inform the conclusions we draw. Unfortunately, considerations of this will sometimes tell us that whatever error occurred during randomization skews our results too much to let us draw clear conclusions.

References

Bowers, Jake, Mark Fredrickson, and Ben Hansen. 2016. RItools: Randomization Inference Tools.
———. 2023. Randomizr: Easy-to-Use Tools for Common Forms of Random Assignment and Sampling. https://declaredesign.org/r/randomizr/.
Hansen, Ben B., and Jake Bowers. 2008. “Covariate Balance in Simple, Stratified and Clustered Comparative Studies.” Statistical Science 23 (2): 219–36.
Hartman, Erin, and F Daniel Hidalgo. 2018. “An Equivalence Approach to Balance and Placebo Tests.” American Journal of Political Science 62 (4): 1000–1013.
Holland, Paul W. 1986. “Statistics and Causal Inference (with Discussion).” Journal of the American Statistical Association 81: 945–70.
Imai, Kosuke, Gary King, and Elizabeth A Stuart. 2008. “Misunderstandings Between Experimentalists and Observationalists about Causal Inference.” Journal of the Royal Statistical Society Series A: Statistics in Society 171 (2): 481–502.
Kerwin, Jason, Nada Rostom, and Olivier Sterck. 2024. “Striking the Right Balance: Why Standard Balance Tests over-Reject the Null, and How to Fix It.” Institute of Labor Economics (IZA).
Miratrix, Luke W, Jasjeet S Sekhon, and Bin Yu. 2013. “Adjusting Treatment Effect Estimates by Post-Stratification in Randomized Experiments.” Journal of the Royal Statistical Society Series B: Statistical Methodology 75 (2): 369–96.
Moore, Ryan T. 2012. “Multivariate Continuous Blocking to Improve Political Science Experiments.” Political Analysis 20 (4): 460–79.
Moore, Ryan T., and Keith Schnakenberg. 2016. blockTools: Blocking, Assignment, and Diagnosing Interference in Randomized Experiments. http://www.ryantmoore.org/html/software.blockTools.html.
Muralidharan, Karthik, Mauricio Romero, and Kaspar Wüthrich. 2023. “Factorial Designs, Model Selection, and (Incorrect) Inference in Randomized Experiments.” Review of Economics and Statistics, 1–44.
Rainey, Carlisle. 2014. “Arguing for a Negligible Effect.” American Journal of Political Science 58 (4): 1083–91.
Small, D. S., K. G. Volpp, and P. R. Rosenbaum. 2011. “Structured Testing of 2×\times 2 Factorial Effects: An Analytic Plan Requiring Fewer Observations.” The American Statistician 65 (1): 11–15.
Stuart, Elizabeth A. 2010. “Matching Methods for Causal Inference: A Review and a Look Forward.” Statistical Science: A Review Journal of the Institute of Mathematical Statistics 25 (1): 1.

  1. We develop the recommendations themselves following processes described in more detail on our website.↩︎

  2. “Fair” in the sense that everyone has an equal chance of receiving treatment. But we could imagine thinking about fairness in other ways, too.↩︎

  3. But see Small, Volpp, and Rosenbaum (2011), for instance, for a way to increase the power of tests for such questions. This approach is not currently part of our standard practice.↩︎

  4. As the authors discuss, this could be framed as either an internal or external validity concern.↩︎

  5. Some studies instead employ “post-stratification,” where simple or complete randomization might be used to assign treatment, but the data are still analyzed as-if they were blocked. See Miratrix, Sekhon, and Yu (2013), for instance, for more discussion of this strategy. Post-stratification is more likely than blocking to increase variance (relative to a simple difference in means) when there is not actually any between-strata variation in potential outcomes. Post-stratification with too many strata may also have variance costs, which is again less likely when treatment assignment is blocked.↩︎

  6. We use “block” rather than “strata” throughout this document to refer to groups of experimental units that are fixed before random assignment occurs.↩︎

  7. If there are no available covariates, we can at least evaluate whether the distribution of treated units appears consistent with our intended strategy.↩︎

  8. This is also useful when interpreting “null effects” for a study’s primary outcome measure. An equivalence test provides a more principled way to judge whether a statistically insignificant difference is attributable to a true null (no imbalance) or a lack of power to detect meaningful imbalance.↩︎

  9. It’s sometimes helpful to go further and look at things like Kolmogorov-Smirnov statistics, Q-Q plots, or statistics for various polynomial transformations of continuous covariates.↩︎