We study A/B experiments that are designed to compare the performance of two recommendation algorithms. Prior work has observed that the stable unit treatment value assumption (SUTVA) often does not hold in large-scale recommendation systems, and hence the estimate for the global treatment effect (GTE) is biased. Specifically, units under the treatment and control algorithms contribute to a shared pool of data that subsequently train both algorithms, resulting in interference between the two groups. In this paper, we investigate when such interference may affect our decision making on which algorithm is better. We formalize this insight under a multi-armed bandit framework and theoretically characterize when the sign of the difference-in-means estimator of the GTE under data sharing aligns with or contradicts the sign of the true GTE. Our analysis identifies the level of exploration versus exploitation as a key determinant of how data sharing impacts decision making, and we propose a detection procedure based on ramp-up experiments to signal incorrect algorithm comparison in practice.
翻译:本研究探讨旨在比较两种推荐算法性能的A/B实验。已有研究指出,在大规模推荐系统中,稳定单位处理值假设(SUTVA)通常难以成立,导致对全局处理效应(GTE)的估计存在偏差。具体而言,处理组与对照组的算法单元会向共享数据池贡献数据,这些数据随后用于训练两种算法,从而引发组间干扰。本文旨在探究此类干扰何时会影响我们对算法优劣的判断决策。我们通过多臂赌博机框架将这一洞见形式化,并从理论上刻画了数据共享情境下GTE的均值差分估计量符号何时与真实GTE符号一致或相悖。分析发现探索与利用的平衡程度是决定数据共享如何影响决策的关键因素,据此我们提出基于渐进实验的检测流程,用于在实践中识别错误的算法比较结果。