Training reinforcement learning-based recommender systems is often hindered by the lack of dynamic and realistic user interactions. To address this limitation, we introduce Lusifer, a novel environment leveraging Large Language Models (LLMs) to generate simulated user feedback. Lusifer synthesizes user profiles and interaction histories to simulate responses and behaviors toward recommended items, with profiles updated after each rating to reflect evolving user characteristics. Utilizing the MovieLens dataset as a proof of concept, we limited our implementation to the last 40 interactions for each user, representing approximately 39% and 22% of the training sets, to focus on recent user behavior. For consistency and to gain insights into the performance of traditional methods with limited data, we implemented baseline approaches using the same data subset. Our results demonstrate that Lusifer accurately emulates user behavior and preferences, even with reduced training data having an RMSE of 1.3 across various test sets. This paper presents Lusifer's operational pipeline, including prompt generation and iterative user profile updates, and compares its performance against baseline methods. The findings validate Lusifer's ability to produce realistic dynamic feedback and suggest that it offers a scalable and adjustable framework for user simulation in online reinforcement learning recommender systems for future studies, particularly when training data is limited.
翻译:训练基于强化学习的推荐系统常常受限于缺乏动态且真实的用户交互。为克服这一限制,我们提出了Lusifer——一种利用大型语言模型生成模拟用户反馈的新型环境。Lusifer通过合成用户画像与交互历史来模拟对推荐项目的响应与行为,并在每次评分后更新用户画像以反映其动态特征。我们以MovieLens数据集作为概念验证,将每个用户的交互记录限制在最近40次(约占训练集的39%和22%),以聚焦近期用户行为。为保持一致性并探究传统方法在有限数据下的性能,我们在相同数据子集上实现了基线方法。实验结果表明,即使在训练数据缩减的情况下(各测试集的均方根误差为1.3),Lusifer仍能准确模拟用户行为与偏好。本文详细阐述了Lusifer的运行流程,包括提示生成与迭代式用户画像更新机制,并将其性能与基线方法进行对比。研究结果验证了Lusifer生成真实动态反馈的能力,表明该框架可为在线强化学习推荐系统的用户模拟提供可扩展且可调节的解决方案,尤其适用于训练数据有限的未来研究场景。