Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow. This work thus studies the problem of feasibly yet accurately estimating the long-term outcome of an algorithm using only historical and short-term experiment data. Existing approaches to this problem either need a restrictive assumption about the short-term outcomes called surrogacy or cannot effectively use short-term outcomes, which is inefficient. Therefore, we propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition. LOPE works under a more relaxed assumption than surrogacy and effectively leverages short-term rewards to substantially reduce the variance. Synthetic experiments show that LOPE outperforms existing approaches particularly when surrogacy is severely violated and the long-term reward is noisy. In addition, real-world experiments on large-scale A/B test data collected on a music streaming platform show that LOPE can estimate the long-term outcome of actual algorithms more accurately than existing feasible methods.
翻译:算法的短期与长期结果往往存在差异,可能产生破坏性的下游效应。典型案例如点击诱饵算法,虽能提升短期点击量,但会损害长期用户参与度。评估长期结果的可行方案是对候选算法进行在线实验或A/B测试,然而观测目标长期结果需耗时数月甚至更久,导致算法选择过程效率过低。为此,本研究探讨如何仅利用历史数据与短期实验数据,可行且准确地估计算法的长期结果。现有方法要么需要假设短期结果满足"替代性"这一严格条件,要么无法有效利用短期结果导致效率低下。我们提出基于奖励函数分解的长期离策略评估(LOPE)新框架。该框架在比替代性更宽松的假设条件下运行,并能有效利用短期奖励大幅降低方差。合成实验表明,当替代性被严重违反且长期奖励存在噪声时,LOPE显著优于现有方法。此外,基于音乐流媒体平台大规模A/B测试数据的真实实验证明,LOPE估算实际算法长期结果的准确度优于现有可行方法。