Power Under Multiplicity Project (PUMP): Estimating Power, Minimum Detectable Effect Size, and Sample Size When Adjusting for Multiple Outcomes in Multi-level Experiments

2023 年 5 月 15 日

翻译：多重结果下统计功效项目(PUMP)：调整多重结果时多层实验中的统计功效、最小可检测效应量及样本量估算

Kristen Hunter,Luke Miratrix,Kristin Porter

from arxiv, 60 pages, 6 figures

For randomized controlled trials (RCTs) with a single intervention being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust $p$-values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially, which reduces the probability of detecting effects when they do exist. However, this consideration is frequently ignored in typical power analyses, as existing tools do not easily accommodate the use of multiple testing procedures. We introduce the PUMP R package as a tool for analysts to estimate statistical power, minimum detectable effect size, and sample size requirements for multi-level RCTs with multiple outcomes. Multiple outcomes are accounted for in two ways. First, power estimates from PUMP properly account for the adjustment in $p$-values from applying a multiple testing procedure. Second, as researchers change their focus from one outcome to multiple outcomes, different definitions of statistical power emerge. PUMP allows researchers to consider a variety of definitions of power, as some may be more appropriate for the goals of their study. The package estimates power for frequentist multi-level mixed effects models, and supports a variety of commonly-used RCT designs and models and multiple testing procedures. In addition to the main functionality of estimating power, minimum detectable effect size, and sample size requirements, the package allows the user to easily explore sensitivity of these quantities to changes in underlying assumptions.

翻译：针对单一干预措施测量多个结果的随机对照试验（RCT），研究者通常采用多重检验程序（如Bonferroni或Benjamini-Hochberg）对$p$值进行调整。此类调整虽能降低虚假发现的概率，但也会改变统计功效（有时幅度较大），进而降低实际存在效应时的检测概率。然而，现有工具难以便捷地整合多重检验程序，导致这一考量在常规功效分析中常被忽视。我们提出PUMP R语言软件包，用于帮助分析者估算具有多重结果的多层RCT的统计功效、最小可检测效应量及样本量需求。该软件包通过双重途径处理多重结果：其一，PUMP的功效估算可正确考量多重检验程序对$p$值调整的影响；其二，当研究者将关注点从单一结果转向多重结果时，统计功效的定义会产生分化。PUMP允许研究者考虑多种功效定义，以适配其研究目标。该软件包可估算频率学派多层混合效应模型的功效，支持多种常用RCT设计、模型及多重检验程序。除估算功效、最小可检测效应量和样本量需求等核心功能外，用户还可便捷地探索这些指标对基础假设变化的敏感性。