Recommender systems are a ubiquitous feature of online platforms. Increasingly, they are explicitly tasked with increasing users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a multi-armed bandit problem with delayed rewards. We observe that there is an apparent trade-off in choosing the learning signal: Waiting for the full reward to become available might take several weeks, hurting the rate at which learning happens, whereas measuring short-term proxy rewards reflects the actual long-term goal only imperfectly. We address this challenge in two steps. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Full observations as well as partial (short or medium-term) outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that takes advantage of this new predictive model. The algorithm quickly learns to identify content aligned with long-term success by carefully balancing exploration and exploitation. We apply our approach to a podcast recommendation problem, where we seek to identify shows that users engage with repeatedly over two months. We empirically validate that our approach results in substantially better performance compared to approaches that either optimize for short-term proxies, or wait for the long-term outcome to be fully realized.
翻译:推荐系统是在线平台的一个普遍特征。它们越来越明确地承担着提升用户长期满意度的任务。在此背景下,我们研究了一个内容探索任务,并将其形式化为带有延迟奖励的多臂摇臂问题。我们注意到在选择学习信号时存在明显的权衡:等待完整奖励可用可能需要数周时间,这会损害学习发生的速度,而测量短期代理奖励则仅能不完全地反映实际的长期目标。我们分两步应对这一挑战。首先,我们开发了一个延迟奖励的预测模型,该模型整合了迄今为止获得的所有信息。通过贝叶斯滤波器将完整观察以及部分(短期或中期)结果结合起来,以获得概率信念。其次,我们设计了一个利用这个新预测模型的摇臂算法。该算法通过仔细平衡探索与利用,快速学会识别与长期成功一致的内容。我们将该方法应用于播客推荐问题,旨在识别用户在两个月内反复参与的内容。我们经验性地验证了,与优化短期代理或等待长期结果完全实现的方法相比,我们的方法在性能上显著更优。