Attention-based sequential recommendation methods have demonstrated promising results by accurately capturing users' dynamic interests from historical interactions. In addition to generating superior user representations, recent studies have begun integrating reinforcement learning (RL) into these models. Framing sequential recommendation as an RL problem with reward signals, unlocks developing recommender systems (RS) that consider a vital aspect-incorporating direct user feedback in the form of rewards to deliver a more personalized experience. Nonetheless, employing RL algorithms presents challenges, including off-policy training, expansive combinatorial action spaces, and the scarcity of datasets with sufficient reward signals. Contemporary approaches have attempted to combine RL and sequential modeling, incorporating contrastive-based objectives and negative sampling strategies for training the RL component. In this study, we further emphasize the efficacy of contrastive-based objectives paired with augmentation to address datasets with extended horizons. Additionally, we recognize the potential instability issues that may arise during the application of negative sampling. These challenges primarily stem from the data imbalance prevalent in real-world datasets, which is a common issue in offline RL contexts. While our established baselines attempt to mitigate this through various techniques, instability remains an issue. Therefore, we introduce an enhanced methodology aimed at providing a more effective solution to these challenges.
翻译:基于注意力的序列推荐方法通过准确捕捉用户历史交互中的动态兴趣,已展现出显著效果。除生成更优的用户表示外,近期研究开始将强化学习引入这些模型。将序列推荐构建为带有奖励信号的强化学习问题,有助于开发能整合关键要素(即通过奖励形式融入直接用户反馈以提供更个性化体验)的推荐系统。然而,采用强化学习算法面临诸多挑战,包括离线策略训练、组合动作空间规模庞大,以及带有充分奖励信号的数据集稀缺。现有方法尝试将强化学习与序列建模相结合,通过基于对比的目标函数和负采样策略训练强化学习组件。本研究进一步强调了结合数据增强的对比学习目标在处理长程数据集时的有效性。同时,我们识别出负采样过程中可能出现的潜在不稳定性问题,这些挑战主要源于现实数据集中普遍存在的数据不平衡现象——这亦是离线强化学习中的常见问题。尽管我们建立的基线方法试图通过多种技术缓解此问题,但稳定性仍存隐忧。为此,我们提出一种增强型方法论,旨在为这些挑战提供更有效的解决方案。