Growing attention has been paid to Reinforcement Learning (RL) algorithms when optimizing long-term user engagement in sequential recommendation tasks. One challenge in large-scale online recommendation systems is the constant and complicated changes in users' behavior patterns, such as interaction rates and retention tendencies. When formulated as a Markov Decision Process (MDP), the dynamics and reward functions of the recommendation system are continuously affected by these changes. Existing RL algorithms for recommendation systems will suffer from distribution shift and struggle to adapt in such an MDP. In this paper, we introduce a novel paradigm called Adaptive Sequential Recommendation (AdaRec) to address this issue. AdaRec proposes a new distance-based representation loss to extract latent information from users' interaction trajectories. Such information reflects how RL policy fits to current user behavior patterns, and helps the policy to identify subtle changes in the recommendation system. To make rapid adaptation to these changes, AdaRec encourages exploration with the idea of optimism under uncertainty. The exploration is further guarded by zero-order action optimization to ensure stable recommendation quality in complicated environments. We conduct extensive empirical analyses in both simulator-based and live sequential recommendation tasks, where AdaRec exhibits superior long-term performance compared to all baseline algorithms.
翻译:在序列推荐任务中优化长期用户参与度时,强化学习算法日益受到关注。大规模在线推荐系统面临的挑战之一在于用户行为模式(如交互频率和留存倾向)持续且复杂的变化。当系统被建模为马尔可夫决策过程时,推荐系统的动态性和奖励函数会持续受到这些变化的影响。现有用于推荐系统的强化学习算法将面临分布偏移问题,难以在此类马尔可夫决策过程中自适应调整。本文提出一种名为自适应序列推荐的新范式来应对该问题。AdaRec提出一种新型基于距离的表征损失函数,用于从用户交互轨迹中提取潜在信息。此类信息反映了强化学习策略对当前用户行为模式的拟合程度,有助于策略识别推荐系统中的细微变化。为快速适应这些变化,AdaRec通过不确定性下的乐观原则鼓励探索,并通过零阶动作优化对探索过程加以约束,确保在复杂环境中保持稳定的推荐质量。我们在基于模拟器和真实序列推荐任务中开展了大量实证分析,结果表明AdaRec在所有基线算法中展现出更优的长期性能。