Test, Split, Repeat — findBlocks • manytestsr

Split and test.

Usage

findBlocks(
  idat,
  bdat,
  blockid = "block",
  splitfn,
  pfn,
  alphafn = NULL,
  simthresh = 20,
  sims = 1000,
  maxtest = 2000,
  thealpha = 0.05,
  thew0 = 0.05 - 0.001,
  fmla = YContNorm ~ trtF | blockF,
  parallel = "multicore",
  ncores = 4,
  copydts = FALSE,
  splitby = "hwt",
  stop_splitby_constant = TRUE,
  blocksize = "hwt",
  trace = FALSE
)

Arguments

idat: Data at the unit level.
bdat: Data at the block level.
blockid: A character name of the column in idat and bdat indicating the block.
splitfn: A function to split the data into two pieces — using bdat
pfn: A function to produce pvalues — using idat.
alphafn: A function to adjust alpha at each step. Takes one or more p-values plus a stratum or batch indicator. Currently alpha_investing, alpha_saffron, alpha_addis are accepted. All of them wrap the corresponding functions from the onlineFDR package.
simthresh: Below which number of total observations should the p-value functions use permutations rather than asymptotic approximations
sims: Number of permutations for permutation-based testing
maxtest: Maximum splits or tests to do. Should probably not be smaller than the number of experimental blocks.
thealpha: Is the error rate for a given test (for cases where alphafn is NULL, or the starting alpha for alphafn not null)
thew0: Is the starting "wealth" of the alpha investing procedure (this is only relevant when alphafn is not null).
fmla: A formula with outcome~treatment assignment | block where treatment assignment and block must be factors.
parallel: Should the pfn use multicore processing for permutation based testing. Default is no. But could be "snow" or "multicore" following approximate in the coin package.
ncores: The number of cores used for parallel processing
copydts: TRUE or FALSE. TRUE if using findBlocks standalone. FALSE if copied objects are being sent to findBlocks from other functions.
splitby: A string indicating which column in bdat contains a variable to guide splitting (for example, a column with block sizes or block harmonic mean weights or a column with a covariate (or a function of covariates) or a column with a factor with levels separated by "." that indicates a pre-specified series of splits (see splitSpecifiedFactor))
stop_splitby_constant: TRUE if the splitting should stop when splitby is constant within a given branch of the tree. FALSE if splitting should continue even when splitby is constant. Default is TRUE. Different combinations of splitby, splitfn, and stop_splitby_constant make more or less sense as described below.
blocksize: A string with the name of the column in bdat contains information about the size of the block (or other determinant of the power of tests within that block, such as harmonic mean weight of the block or variance of the outcome within the block.)
trace: Logical, FALSE (default) to not print split number. TRUE prints the split number.

Value

A data.table containing information about the sequence of splitting and testing

Details

Some notes about the splitting functions and how they relate to splitting criteria (splitby) and stopping criteria (stop_splitby_constant).

splitLOO() chooses the blocks largest on the splitby vector one at a time so that we have two tests, one focusing on the highest ranked block and one on all of the rest of the blocks (for example, the block with the most units in it versus the rest of the blocks). When the splitby vector has ties, it chooses one block at random among those tied for the first or largest rank. When the split vector has few values, for example, only two values, it will still split assuming that the vector is numeric (so, 1 is ranked higher than 0) and then randomly among ties. If stop_splitby_constant=TRUE, then the algorithm will stop after exhausting the blocks in the higher ranked category (thinking about the binary splitby case). For this reason we advise against using splitLOO with a factor splitby vector with few categories. splitLOO() is best used with a splitby vector like block-size — which could be constant and thus just create a random choice of a single block or could vary and thus focus the testing on the largest/highest ranked blocks.
splitEqualApprox() splits the sets of blocks into two groups where the sum of the splitby vector is approximately the same in each split. For example, if splitby is number of units in a block, then this splitting function makes two groups of blocks, each group having the same total number of units. This splitting function will work with discrete or factors but will do: rank_splitby <- rank(splitby) and then divide the blocks into groups based on taking every other rank. So, for factors variables with few categories that are ordered, this will allocate every other category to one or another group.
splitCluster() splits the blocks into groups that are as similar as possible to each other on splitby using the kmeans clustering algorithm (using a combination of kmeans() or KMeans_rcpp()). This will not work with factor variables. When the splitting criteria is constant, it will return random splits into roughly two equal sized groups of blocks if stop_splitby_constant=FALSE. If stop_splitby_constant=TRUE then findBlocks() will stop and return groups of blocks as detected or not.
splitSpecifiedFactor() will split the blocks into two groups following prespecified pattern encoded into the labels for the levels of the factor. For example, if we imagine three nested levels of splitting (like states, districts, neighborhoods), the factor would have labels like category1_level1.category2_level1.category3_level1 and where splits will occur from left to right depending on whether there is existing variation at that level. When the factor is constant and stop_splitby_constant=TRUE splitting stops. For this reason we recommend that the right-most label of this factor be the individual blocks themselves—to ensure that testing descends to the block level if it can. When stop_splitby_constant=FALSE, then it uses random splits.
splitSpecifiedFactorMulti() will split the blocks into two or more groups following prespecified pattern encoded into the labels for the levels of the factor. For example, if we imagine three nested levels of splitting (like states, districts, neighborhoods), the factor would have labels like category1_level1.category2_level1.category3_level1 and where splits will occur from left to right depending on whether there is existing variation at that level. For this reason we recommend that the right-most label of this factor be the individual blocks themselves—to ensure that testing descends to the block level if it can. When the factor is constant and stop_splitby_constant=TRUE splitting stops. When stop_splitby_constant=FALSE, then it uses random splits.