While there exists a large amount of literature on the general challenges of and best practices for trustworthy online A/B testing, there are limited studies on sample size estimation, which plays a crucial role in trustworthy and efficient A/B testing that ensures the resulting inference has a sufficient power and type I error control. For example, when sample size is under-estimated, the statistical inference, even with the correct analysis methods, will not be able to detect the true significant improvement leading to misinformed and costly decisions. This paper addresses this fundamental gap by developing new sample size calculation methods for correlated data, as well as absolute vs. relative treatment effects, both ubiquitous in online experiments. Additionally, we address a practical question of the minimal observed difference that will be statistically significant and how it relates to average treatment effect and sample size calculation. All proposed methods are accompanied by mathematical proofs, illustrative examples, and simulations. We end by sharing some best practices on various practical topics on sample size calculation and experimental design.
翻译:尽管关于可信在线A/B测试的普遍挑战与最佳实践已有大量文献,但样本量估计的相关研究仍显不足——而样本量估计在保证测试可信性与效率中至关重要,它能确保统计推断具备充分的统计功效与I类错误控制。例如,当样本量被低估时,即便采用正确的分析方法,统计推断也无法检测到真实的显著改进,从而导致信息误导与代价高昂的决策。本文针对这一根本性空白,开发了适用于相关数据以及绝对与相对处理效应的新型样本量计算方法——这两种情况在在线实验中普遍存在。此外,我们探讨了一个实践性问题:使结果具有统计显著性的最小观测差异,以及该差异如何与平均处理效应和样本量计算相关联。所有提出的方法均附有数学证明、示例说明与仿真验证。最后,我们分享了关于样本量计算与实验设计的多项实践主题的最佳实践建议。