Experimental evaluations of public policies often randomize a new intervention within many sites or blocks. After a report of an overall result -- statistically significant or not -- the natural question from a policy maker is: \emph{where} did any effects occur? Standard adjustments for multiple testing provide little power to answer this question. In simulations modeled after a 44-block education trial, the Hommel adjustment -- among the most powerful procedures controlling the family-wise error rate (FWER) -- detects effects in only 11\% of truly non-null blocks. We develop a procedure that tests hypotheses top-down through a tree: test the overall null at the root, then groups of blocks, then individual blocks, stopping any branch where the null is not rejected. In the same 44-block design, this approach detects effects in 44\% of non-null blocks -- roughly four times the detection rate. A stopping rule and valid tests at each node suffice for weak FWER control. We show that the strong-sense FWER depends on how rejection probabilities accumulate along paths through the tree. This yields a diagnostic: when power decays fast enough relative to branching, no adjustment is needed; otherwise, an adaptive $α$-adjustment restores control. We apply the method to 25 MDRC education trials and provide an R package, \texttt{manytestsr}.
翻译:公共政策的实验评估通常在许多站点或区块内随机实施新干预措施。在报告总体结果(无论是否具有统计显著性)后,政策制定者自然会提出这样的问题:效应究竟发生在\emph{何处}?传统的多重检验校正方法对此问题的检测功效有限。在以一项包含44个区块的教育试验为模型的模拟中,Hommel校正——作为控制族错误率(FWER)功效最强的程序之一——仅在11%的真实非零效应区块中检测到效应。本研究开发了一种通过树结构自上而下检验假设的程序:在根节点检验整体零假设,随后检验区块组假设,最后检验单个区块假设,并在零假设未被拒绝的任何分支处停止检验。在相同的44区块设计中,该方法在44%的非零效应区块中检测到效应——检测率提升约四倍。每个节点的停止规则与有效检验足以实现弱FWER控制。我们证明强FWER控制取决于拒绝概率沿树路径的累积方式。由此推导出诊断准则:当检验功效相对于分支衰减足够快时,无需进行校正;否则,自适应$α$调整可恢复控制。我们将该方法应用于25项MDRC教育试验,并提供了R语言包\texttt{manytestsr}。