Online experiments such as Randomised Controlled Trials (RCTs) or A/B-tests are the bread and butter of modern platforms on the web. They are conducted continuously to allow platforms to estimate the causal effect of replacing system variant "A" with variant "B", on some metric of interest. These variants can differ in many aspects. In this paper, we focus on the common use-case where they correspond to machine learning models. The online experiment then serves as the final arbiter to decide which model is superior, and should thus be shipped. The statistical literature on causal effect estimation from RCTs has a substantial history, which contributes deservedly to the level of trust researchers and practitioners have in this "gold standard" of evaluation practices. Nevertheless, in the particular case of machine learning experiments, we remark that certain critical issues remain. Specifically, the assumptions that are required to ascertain that A/B-tests yield unbiased estimates of the causal effect, are seldom met in practical applications. We argue that, because variants typically learn using pooled data, a lack of model interference cannot be guaranteed. This undermines the conclusions we can draw from online experiments with machine learning models. We discuss the implications this has for practitioners, and for the research literature.
翻译:在线实验,例如随机对照试验(RCT)或 A/B 测试,是现代网络平台的基础工具。平台通过持续进行此类实验,以评估将系统变体"A"替换为变体"B"对某些指标产生的因果效应。这些变体可能在多个方面存在差异。本文聚焦于一个常见场景:当这些变体对应机器学习模型时,在线实验便作为最终裁决者,判定哪个模型更优并应被部署。基于随机对照试验的因果效应估计统计文献历史悠久,这恰当地增强了研究人员与实践者对这一"黄金标准"评估实践的信任程度。然而,在机器学习实验的特殊情况下,我们注意到仍存在某些关键问题。具体而言,确保 A/B 测试得出无偏因果效应估计所需的前提假设,在实际应用中很少得到满足。我们认为,由于变体通常使用聚合数据进行学习,模型间无干扰的假设无法得到保证。这削弱了我们从机器学习模型在线实验中得出的结论的可靠性。我们讨论了这一现象对实践者及研究文献的启示。