In designing an online A/B experiment, it is crucial to select a sample size and duration that ensure the resulting confidence interval (CI) for the treatment effect is the right width to detect an effect of meaningful magnitude with sufficient statistical power without wasting resources. While the relationship between sample size and CI width is well understood, the effect of experiment duration on CI width remains less clear. This paper provides an analytical formula for the width of a CI based on a ratio treatment effect estimator as a function of both sample size (N) and duration (T). The formula is derived from a mixed effects model with two variance components. One component, referred to as the temporal variance, persists over time for experiments where the same users are kept in the same experiment arm across different days. The remaining error variance component, by contrast, decays to zero as T gets large. The formula we derive introduces a key parameter that we call the user-specific temporal correlation (UTC), which quantifies the relative sizes of the two variance components and can be estimated from historical experiments. Higher UTC indicates a slower decay in CI width over time. On the other hand, when the UTC is 0 -- as for experiments where users shuffle in and out of the experiment across days -- the CI width decays at the standard parametric 1/T rate. We also study how access to pre-period data for the users in the experiment affects the CI width decay. We show our formula closely explains CI widths on real A/B experiments at YouTube.
翻译:在设计在线A/B实验时,选择能够确保处理效应置信区间(CI)宽度合适的样本量和实验时长至关重要,以便以足够的统计功效检测到有意义的效应大小,同时避免资源浪费。尽管样本量与CI宽度之间的关系已得到充分理解,但实验时长对CI宽度的影响仍不够明确。本文基于比率处理效应估计量,推导出CI宽度作为样本量(N)和时长(T)函数的解析公式。该公式源自一个包含两个方差分量的混合效应模型。其中一个分量称为时间方差,在实验中将同一用户持续分配至相同实验组别多日的情况下会随时间持续存在。相比之下,剩余的误差方差分量则随着T增大而衰减至零。我们推导的公式引入了一个关键参数,称为用户特定时间相关性(UTC),该参数量化了两个方差分量的相对大小,并可从历史实验中估计。较高的UTC表示CI宽度随时间衰减较慢。另一方面,当UTC为0时——例如在用户在不同实验日间随机进出实验的情况下——CI宽度以标准参数速率1/T衰减。我们还研究了获取实验用户前期数据对CI宽度衰减的影响。我们证明,该公式能够准确解释YouTube真实A/B实验中的CI宽度。