Experimental evaluations of public policies often randomize a new intervention within many sites or blocks. After a report of an overall result -- statistically significant or not -- the natural question from a policy maker is: \emph{where} did any effects occur? Standard adjustments for multiple testing provide little power to answer this question. In simulations modeled after a 44-block education trial, the Hommel adjustment -- among the most powerful procedures controlling the family-wise error rate (FWER) -- detects effects in only 11\% of truly non-null blocks. We develop a procedure that tests hypotheses top-down through a tree: test the overall null at the root, then groups of blocks, then individual blocks, stopping any branch where the null is not rejected. In the same 44-block design, this approach detects effects in 44\% of non-null blocks -- roughly four times the detection rate. A stopping rule and valid tests at each node suffice for weak FWER control. We show that the strong-sense FWER depends on how rejection probabilities accumulate along paths through the tree. This yields a diagnostic: when power decays fast enough relative to branching, no adjustment is needed; otherwise, an adaptive $α$-adjustment restores control. We apply the method to 25 MDRC education trials and provide an R package, \texttt{manytestsr}.
翻译:公共政策实验评估通常会在多个地点或区块内随机实施新干预措施。在获得总体结果(无论是否具有统计显著性)报告后,政策制定者最自然的问题是:效应究竟发生在\emph{何处}?传统的多重检验校正方法对此问题的检验效能有限。在模拟一项包含44个区块的教育试验时,即使在控制族错误率(FWER)的最强检验方法中,Hommel校正也仅能在11%的真实非零效应区块中检测到效应。本研究开发了一种通过树结构自上而下检验假设的流程:首先在根节点检验整体零假设,随后检验区块组假设,最后检验单个区块假设,并在任一分支未拒绝零假设时终止检验。在相同的44区块设计中,该方法能在44%的非零效应区块中检测到效应——检测率提升约四倍。通过制定终止规则并在每个节点执行有效检验,即可实现弱FWER控制。我们证明强FWER控制取决于拒绝概率沿树路径的累积方式。由此推导出诊断准则:当检验效能衰减速度相对于分支增长足够快时,无需进行校正;否则,通过自适应$α$值调整可恢复控制效果。我们将该方法应用于25项MDRC教育试验,并提供了R语言工具包\texttt{manytestsr}。