Reinforcement Learning (RL)-based recommender systems have demonstrated promising performance in meeting user expectations by learning to make accurate next-item recommendations from historical user-item interactions. However, existing offline RL-based sequential recommendation methods face the challenge of obtaining effective user feedback from the environment. Effectively modeling the user state and shaping an appropriate reward for recommendation remains a challenge. In this paper, we leverage language understanding capabilities and adapt large language models (LLMs) as an environment (LE) to enhance RL-based recommenders. The LE is learned from a subset of user-item interaction data, thus reducing the need for large training data, and can synthesise user feedback for offline data by: (i) acting as a state model that produces high quality states that enrich the user representation, and (ii) functioning as a reward model to accurately capture nuanced user preferences on actions. Moreover, the LE allows to generate positive actions that augment the limited offline training data. We propose a LE Augmentation (LEA) method to further improve recommendation performance by optimising jointly the supervised component and the RL policy, using the augmented actions and historical user signals. We use LEA, the state and reward models in conjunction with state-of-the-art RL recommenders and report experimental results on two publicly available datasets.
翻译:基于强化学习的推荐系统通过从历史用户-物品交互中学习如何准确推荐下一项物品,在满足用户期望方面展现出良好性能。然而,现有离线强化学习序列推荐方法面临从环境中获取有效用户反馈的挑战。如何有效建模用户状态并构建合适的推荐奖励机制仍是一项难题。本文利用语言理解能力,将大型语言模型作为环境模块来增强强化学习推荐系统。该环境模块通过用户-物品交互数据子集进行学习,从而减少对大规模训练数据的依赖,并能通过以下方式为离线数据合成用户反馈:(i)作为状态模型生成高质量状态以丰富用户表征;(ii)作为奖励模型精准捕捉用户对动作的细微偏好。此外,该环境模块还能生成正向动作以扩充有限的离线训练数据。我们提出环境模块增强方法,通过联合优化监督组件和强化学习策略,利用增强动作与历史用户信号进一步提升推荐性能。我们将环境模块、状态模型与奖励模型同最先进的强化学习推荐器结合使用,并在两个公开数据集上报告实验结果。