We argue that Bonferroni correction is a better choice for online experimentation than it is commonly given credit for. The case rests on four considerations. First, it is the simplest broadly implementable FWER-controlling method that produces unconditional simultaneous confidence intervals for every metric. Second, in a well-specified decision framework, guardrail and quality metrics use intersection-union logic and cannot inflate the false positive rate, so the Bonferroni denominator is the number of success metrics only, not the total metric count. Third, it is uniquely tractable for pre-experiment sample size calculations. Fourth, we contextualise the power cost empirically. Drawing on a simulation study and an empirical analysis of 1,296 experiments run on Spotify's experimentation platform, Confidence, we show that the power loss relative to more sophisticated FWER methods depends on both how the correction family is specified and how many metrics are truly non-null. When guardrail metrics are incorrectly included in the family, Holm and Hommel are nearly indistinguishable from Bonferroni. When the family is correctly restricted to success metrics only, they gain roughly 4--5 percentage points in ship rate (the fraction of experiments where the treatment is deployed). When few metrics are truly non-null, the gap narrows to near zero regardless of method.
翻译:摘要:我们认为,在在线实验中,Bonferroni校正所获得的评价远低于其实际价值。这一论点基于四个考量因素。首先,它是实现最简单、可广泛实施的FWER控制方法,能为每个指标提供无条件的联合置信区间。其次,在定义完善的决策框架中,护栏指标和质量指标采用并-交逻辑,不会增加假阳性率,因此Bonferroni分母仅包含成功指标数量,而非全部指标总数。第三,该方法在实验前样本量计算中具有独特的可操作性。第四,我们通过经验数据量化了其统计功效代价。基于模拟实验以及对Spotify实验平台Confidence上运行1296个实验的实证分析,我们证明:与更复杂的FWER方法相比,Bonferroni的统计功效损失取决于校正家族的设定方式以及真正非零指标的数量。当错误地将护栏指标纳入校正家族时,Holm法与Hommel法与Bonferroni几乎无差异;当校正家族正确限定为成功指标时,这两种方法的发布率(实验组部署处理方案的实验占比)可提高约4~5个百分点。当真正非零指标数量较少时,无论采用何种方法,其差距均趋近于零。