Online controlled experiments, such as A/B-tests, are commonly used by modern tech companies to enable continuous system improvements. Despite their paramount importance, A/B-tests are expensive: by their very definition, a percentage of traffic is assigned an inferior system variant. To ensure statistical significance on top-level metrics, online experiments typically run for several weeks. Even then, a considerable amount of experiments will lead to inconclusive results (i.e. false negatives, or type-II error). The main culprit for this inefficiency is the variance of the online metrics. Variance reduction techniques have been proposed in the literature, but their direct applicability to commonly used ratio metrics (e.g. click-through rate or user retention) is limited. In this work, we successfully apply variance reduction techniques to ratio metrics on a large-scale short-video platform: ShareChat. Our empirical results show that we can either improve A/B-test confidence in 77% of cases, or can retain the same level of confidence with 30% fewer data points. Importantly, we show that the common approach of including as many covariates as possible in regression is counter-productive, highlighting that control variates based on Gradient-Boosted Decision Tree predictors are most effective. We discuss the practicalities of implementing these methods at scale and showcase the cost reduction they beget.
翻译:在线对照实验(如A/B测试)是现代科技公司持续优化系统的常用手段。尽管至关重要,但A/B测试成本高昂:根据其定义,一定比例的流量会被分配到性能较差的系统变体。为确保核心指标的统计显著性,在线实验通常需要运行数周。即便如此,大量实验仍会得出不明确结果(即假阴性或第二类错误)。导致这种低效的主要原因是在线指标的方差。文献中已提出多种方差缩减技术,但这些技术对常用的比率指标(如点击率或用户留存)的直接适用性有限。本研究成功将方差缩减技术应用于大型短视频平台ShareChat的比率指标中。实验结果表明,我们能够在77%的案例中提升A/B测试的置信度,或在保持同等置信水平的情况下减少30%的数据量。值得注意的是,我们证明在回归中纳入尽可能多的协变量这一常见方法适得其反,而基于梯度提升决策树预测器的控制变量法最为有效。我们讨论了大规模实施这些方法的实用性问题,并展示了它们带来的成本降低效果。