In industry, online randomized controlled experiment (a.k.a A/B experiment) is a standard approach to measure the impact of a causal change. These experiments have small treatment effect to reduce the potential blast radius. As a result, these experiments often lack statistical significance due to low signal-to-noise ratio. To improve the precision (or reduce standard error), we introduce the idea of trigger observations where the output of the treatment and the control model are different. We show that the evaluation with full information about trigger observations (full knowledge) improves the precision in comparison to a baseline method. However, detecting all such trigger observations is a costly affair, hence we propose a sampling based evaluation method (partial knowledge) to reduce the cost. The randomness of sampling introduces bias in the estimated outcome. We theoretically analyze this bias and show that the bias is inversely proportional to the number of observations used for sampling. We also compare the proposed evaluation methods using simulation and empirical data. In simulation, evaluation with full knowledge reduces the standard error as much as 85%. In empirical setup, evaluation with partial knowledge reduces the standard error by 36.48%.
翻译:在工业界,在线随机对照实验(即A/B实验)是衡量因果变化影响的标准方法。这类实验通常设置较小的处理效应以降低潜在影响范围,但这也导致其常因信噪比过低而缺乏统计显著性。为提高实验精度(或降低标准误差),我们引入了触发观测的概念——即处理组与对照组模型输出存在差异的观测样本。研究表明,相较于基线方法,采用完全触发观测信息(完全知识)的评估方法能显著提升精度。然而,检测全部触发观测成本高昂,为此我们提出基于采样的评估方法(部分知识)以降低开销。采样随机性会引入估计结果偏差,我们通过理论分析证明该偏差与采样观测数量成反比。最后,我们通过模拟数据和实证数据对所提评估方法进行比较:在模拟实验中,完全知识评估可将标准误差降低高达85%;在实证场景中,部分知识评估可使标准误差降低36.48%。