Auction-based recommender systems are prevalent in online advertising platforms, but they are typically optimized to allocate recommendation slots based on immediate expected return metrics, neglecting the downstream effects of recommendations on user behavior. In this study, we employ reinforcement learning to optimize for long-term return metrics in an auction-based recommender system. Utilizing temporal difference learning, a fundamental reinforcement learning algorithm, we implement an one-step policy improvement approach that biases the system towards recommendations with higher long-term user engagement metrics. This optimizes value over long horizons while maintaining compatibility with the auction framework. Our approach is grounded in dynamic programming ideas which show that our method provably improves upon the existing auction-based base policy. Through an online A/B test conducted on an auction-based recommender system which handles billions of impressions and users daily, we empirically establish that our proposed method outperforms the current production system in terms of long-term user engagement metrics.
翻译:基于拍卖的推荐系统在在线广告平台中普遍存在,但这类系统通常针对基于即时预期回报指标的推荐位分配进行优化,忽略了推荐对用户行为的后续影响。本研究采用强化学习对拍卖式推荐系统中的长期回报指标进行优化。利用时序差分学习这一基础强化学习算法,我们实现了一种单步策略改进方法,引导系统偏向具有更高长期用户参与度指标的推荐方案。该方法在保持与拍卖框架兼容性的同时,实现了跨长期时间维度的价值优化。我们的方法基于动态规划思想,从理论上证明了该方法能够显著改进现有拍卖式基础策略。通过对一个每日处理数十亿次展示与用户请求的拍卖式推荐系统进行在线A/B测试,我们从实证角度证明,所提出的方法在长期用户参与度指标上优于当前生产系统。