We study A/B testing, the standard protocol for measuring the performance gain of a new decision system relative to a baseline. Traditional A/B testing treats both systems as black boxes, ignoring potential similarities between them. In practice, however, new and baseline systems are rarely radically different and often share significant structure, which can be captured by their propensities to make similar decisions. We show that in such cases, the commonly used difference-in-means estimator, though unbiased, is statistically suboptimal. Leveraging off-policy estimation, we introduce a family of A/B testing estimators that exploit the propensities of the tested systems to achieve improved concentration properties. This family is flexible enough to be tailored to practical decision-making. The resulting estimators are simple, robust to propensities misspecification, substantially more accurate when the tested systems exhibit similarities, and gracefully fall back to the difference-in-means estimator when such similarities are absent. Our theoretical analysis and empirical studies confirm their efficiency and practicality.
翻译:我们研究A/B测试——衡量新决策系统相对于基线系统性能提升的标准协议。传统A/B测试将两个系统视为黑箱,忽略了它们之间潜在的相似性。然而在实践中,新系统与基线系统极少存在根本性差异,往往共享显著的结构特征,这种特征可通过它们做出相似决策的倾向性加以捕捉。研究表明,在此类情形下,常用的均值差异估计量虽然无偏,但在统计上并非最优。我们借助离线策略估计,引入一类利用测试系统倾向性来获得更优集中特性的A/B测试估计量族。该估计量族具有充分灵活性,可针对实际决策需求进行定制。由此产生的估计量结构简洁、对倾向性设定错误具有稳健性,在测试系统呈现相似性时能显著提高精度,且在缺乏此类相似性时可优雅地退化为均值差异估计量。我们的理论分析与实证研究均证实了其高效性与实用性。