Leveraging covariate adjustments at scale in online A/B testing

Companies offering web services routinely run randomized online experiments to estimate the causal impact associated with the adoption of new features and policies on key performance metrics of interest. These experiments are used to estimate a variety of effects: the increase in click rate due to the repositioning of a banner, the impact on subscription rate as a consequence of a discount or special offer, etc. In these settings, even effects whose sizes are very small can have large downstream impacts. The simple difference in means estimator (Splawa-Neyman et al., 1990) is still the standard estimator of choice for many online A/B testing platforms due to its simplicity. This method, however, can fail to detect small effects, even when the experiment contains thousands or millions of observational units. As a by-product of these experiments, however, large amounts of additional data (covariates) are collected. In this paper, we discuss benefits, costs and risks of allowing experimenters to leverage more complicated estimators that make use of covariates when estimating causal effects of interest. We adapt a recently proposed general-purpose algorithm for the estimation of causal effects with covariates to the setting of online A/B tests. Through this paradigm, we implement several covariate-adjusted causal estimators. We thoroughly evaluate their performance at scale, highlighting benefits and shortcomings of different methods. We show on real experiments how "covariate-adjusted" estimators can (i) lead to more precise quantification of the causal effects of interest and (ii) fix issues related to imbalance across treatment arms - a practical concern often overlooked in the literature. In turn, (iii) these more precise estimates can reduce experimentation time, cutting cost and helping to streamline decision-making processes, allowing for faster adoption of beneficial interventions.

翻译：提供网络服务的公司通常会开展随机在线实验，以评估新功能或新政策对关键性能指标的因果影响。这些实验用于估计多种效应：例如横幅位置调整带来的点击率提升、折扣或优惠活动对订阅率的影响等。在此类场景中，即使效应量级微小，也可能产生显著的下游影响。由于简单性优势，均值差估计量（Splawa-Neyman et al., 1990）仍是众多在线A/B测试平台的标准选择。然而，当实验包含成千上万甚至百万级观测单元时，该方法可能无法检测到微小效应。尽管实验过程中会同步收集大量协变量数据，但传统方法未充分加以利用。本文探讨了实验人员利用协变量构建更复杂估计量来估算因果效应的收益、成本与风险。我们针对在线A/B测试场景，适配了一种近期提出的通用因果效应估计算法，并基于该框架实现了多种协变量调整因果估计量。通过大规模性能评估，系统分析了不同方法的优劣。基于真实实验数据证明：协变量调整估计量能够（i）更精确地量化目标因果效应；（ii）解决实验组间不均衡问题——这一实际困扰长期被文献忽视；（iii）更精准的估计可缩短实验周期，降低实验成本，优化决策流程，从而更快采纳有利干预措施。