Auction-based recommender systems are prevalent in online advertising platforms, but they are typically optimized to allocate recommendation slots based on immediate expected return metrics, neglecting the downstream effects of recommendations on user behavior. In this study, we employ reinforcement learning to optimize for long-term return metrics in an auction-based recommender system. Utilizing temporal difference learning, a fundamental reinforcement learning algorithm, we implement an one-step policy improvement approach that biases the system towards recommendations with higher long-term user engagement metrics. This optimizes value over long horizons while maintaining compatibility with the auction framework. Our approach is grounded in dynamic programming ideas which show that our method provably improves upon the existing auction-based base policy. Through an online A/B test conducted on an auction-based recommender system which handles billions of impressions and users daily, we empirically establish that our proposed method outperforms the current production system in terms of long-term user engagement metrics.
翻译:拍卖式推荐系统在在线广告平台中广泛应用,但通常以即时预期回报指标优化推荐位分配,忽视了推荐内容对用户行为的后续影响。本研究采用强化学习优化拍卖式推荐系统中的长期回报指标。通过应用基础强化学习算法——时序差分学习,我们实现了一种单步策略改进方法,使系统偏向于具有更高长期用户参与度指标的推荐内容。该方法在长期时间范围内优化价值,同时保持与拍卖框架的兼容性。我们的方法基于动态规划思想,理论证明其可实质性地改进现有拍卖式基础策略。通过在日均处理数十亿次展示和用户的拍卖式推荐系统上进行的在线A/B测试,我们实证证明所提方法在长期用户参与度指标上优于当前生产系统。