While there exists a large amount of literature on the general challenges of and best practices for trustworthy online A/B testing, there are limited studies on sample size estimation, which plays a crucial role in trustworthy and efficient A/B testing that ensures the resulting inference has a sufficient power and type I error control. For example, when sample size is under-estimated, the statistical inference, even with the correct analysis methods, will not be able to detect the true significant improvement leading to misinformed and costly decisions. This paper addresses this fundamental gap by developing new sample size calculation methods for correlated data, as well as absolute vs. relative treatment effects, both ubiquitous in online experiments. Additionally, we address a practical question of the minimal observed difference that will be statistically significant and how it relates to average treatment effect and sample size calculation. All proposed methods are accompanied by mathematical proofs, illustrative examples, and simulations. We end by sharing some best practices on various practical topics on sample size calculation and experimental design.
翻译:尽管已有大量文献探讨可信在线A/B测试的一般性挑战与最佳实践,但针对样本量估计的研究仍十分有限。样本量估计在可信且高效的A/B测试中至关重要,它确保最终推断具有足够的统计功效并有效控制第一类错误率。例如,当样本量被低估时,即便采用正确的分析方法,统计推断也无法检测到真实的显著改进,从而导致信息偏差和代价高昂的决策。本文通过针对相关数据以及绝对与相对处理效应(两者在在线实验中普遍存在)开发新型样本量计算方法,弥补了这一根本性空白。此外,我们解决了实际中“最小可达到统计显著性的观测差异”问题,并探讨了其与平均处理效应及样本量计算的关系。所有提出的方法均配有数学证明、示例说明和模拟验证。最后,我们分享了关于样本量计算与实验设计各项实际主题的最佳实践。