Attention-based sequential recommendation methods have shown promise in accurately capturing users' evolving interests from their past interactions. Recent research has also explored the integration of reinforcement learning (RL) into these models, in addition to generating superior user representations. By framing sequential recommendation as an RL problem with reward signals, we can develop recommender systems that incorporate direct user feedback in the form of rewards, enhancing personalization for users. Nonetheless, employing RL algorithms presents challenges, including off-policy training, expansive combinatorial action spaces, and the scarcity of datasets with sufficient reward signals. Contemporary approaches have attempted to combine RL and sequential modeling, incorporating contrastive-based objectives and negative sampling strategies for training the RL component. In this work, we further emphasize the efficacy of contrastive-based objectives paired with augmentation to address datasets with extended horizons. Additionally, we recognize the potential instability issues that may arise during the application of negative sampling. These challenges primarily stem from the data imbalance prevalent in real-world datasets, which is a common issue in offline RL contexts. Furthermore, we introduce an enhanced methodology aimed at providing a more effective solution to these challenges. Experimental results across several real datasets show our method with increased robustness and state-of-the-art performance.
翻译:基于注意力机制的序列推荐方法在从用户历史交互中准确捕捉其动态兴趣方面已展现出潜力。近期研究除了生成更优的用户表征外,还探索将强化学习整合到这些模型中。通过将序列推荐框架化为一个带有奖励信号的强化学习问题,我们可以开发出能够以奖励形式融入直接用户反馈的推荐系统,从而增强对用户的个性化。然而,采用强化学习算法面临多重挑战,包括离策略训练、庞大的组合动作空间以及缺乏具有充足奖励信号的数据集。当前方法尝试将强化学习与序列建模相结合,通过引入基于对比学习的目标函数和负采样策略来训练强化学习组件。在本项工作中,我们进一步强调基于对比学习的目标函数配合数据增强技术在处理具有长交互序列的数据集时的有效性。同时,我们识别出在应用负采样过程中可能出现的潜在不稳定性问题。这些挑战主要源于现实世界数据集中普遍存在的数据不平衡现象——这是离线强化学习场景中的常见问题。此外,我们提出了一种增强型方法,旨在为这些挑战提供更有效的解决方案。在多个真实数据集上的实验结果表明,我们的方法具有更强的鲁棒性并达到了最先进的性能水平。