Experimental evaluations of public policy often randomize an intervention within many sites or blocks. Once an overall effect is reported, the question that matters for action is where it occurred. Standard multiple-testing corrections answer with little power because they ignore how the experiment is organized: blocks nest within cohorts, sites, and districts. We organize the hypotheses as a tree that follows this administrative structure and test them top-down, descending into a branch only when its parent null is rejected. We show that stopping rule and valid node-level tests suffice for weak control of the family-wise error rate (FWER). Whether the same procedure also controls the FWER in the strong sense depends on a single quantity computable before any data are seen: an error load that summarizes how rejection probability accumulates along paths through the tree. This diagnostic tells an analyst in advance, from design quantities alone, whether the unadjusted procedure controls the FWER or an adjustment is required. Across 25 block-randomized MDRC education trials it indicates that no adjustment is needed in every one, so the two conditions alone control the FWER while each test runs at the full nominal level; the top-down procedure detects individual blocks that the Hommel correction misses and locates higher-level groups of blocks that bottom-up testing cannot evaluate. For high-error-load designs we derive an adaptive alpha-schedule, prove it controls the FWER on regular, irregular, and pruned trees, and confirm it in simulation. The same diagnostic flags when it is needed: in a design calibrated to the National Job Corps Study, a wide multisite trial of about one hundred centers, the unadjusted procedure inflates the FWER, the adaptive schedule restores control, and top-down testing still detects more affected sites than bottom-up or hierarchical corrections.
翻译:公共政策的实验评估常在许多地点或区块内随机分配干预措施。当报告总体效应后,对行动至关紧要的问题是效应发生在何处。标准的多重检验校正因忽略实验的组织方式而鲜有效力:区块嵌套于队列、地点和区域之中。我们将假设组织成遵循这一行政结构的树状结构,并自上而下进行检验,仅当父节点的原假设被拒绝时才进入其分支。我们证明,停止规则和有效的节点级检验足以在弱意义上控制族系错误率。同一程序是否能也在强意义上控制族系错误率取决于一个在数据观测前即可计算的量:一个概括拒绝概率沿树路径累积方式的错误负荷。这一诊断指标使分析者能提前从设计量值中判断:未经调整的程序能否控制族系错误率,抑或需要进行调整。在25个区块随机化的MDRC教育试验中,它显示每项试验均无需调整,因此仅凭这两个条件即可在每次检验维持完整名义水平的同时控制族系错误率;自上而下的程序能检测出Hommel校正遗漏的单个区块,并定位自下而上检验无法评估的高层级区块组。针对高错误负荷设计,我们推导出一种自适应alpha调度方案,证明其能在规则、不规则及修剪树上控制族系错误率,并通过模拟验证。同一诊断指标在需要时发出信号:在参照国家职业兵团研究校准的设计中——一项涉及约百个中心的广泛多地点试验——未经调整的程序膨胀了族系错误率,自适应调度方案恢复了控制,而自上而下检验相较于自下而上或分层校正仍能检测出更多受影响的站点。