Reinforcement learning (RL) recommender systems often rely on static datasets that fail to capture the fluid, ever changing nature of user preferences in real-world scenarios. Meanwhile, generative AI techniques have emerged as powerful tools for creating synthetic data, including user profiles and behaviors. Recognizing this potential, we introduce Lusifer, an LLM-based simulation environment designed to generate dynamic, realistic user feedback for RL-based recommender training. In Lusifer, user profiles are incrementally updated at each interaction step, with Large Language Models (LLMs) providing transparent explanations of how and why preferences evolve. We focus on the MovieLens dataset, extracting only the last 40 interactions for each user, to emphasize recent behavior. By processing textual metadata (such as movie overviews and tags) Lusifer creates more context aware user states and simulates feedback on new items, including those with limited or no prior ratings. This approach reduces reliance on extensive historical data and facilitates cold start scenario handling and adaptation to out of distribution cases. Our experiments compare Lusifer with traditional collaborative filtering models, revealing that while Lusifer can be comparable in predictive accuracy, it excels at capturing dynamic user responses and yielding explainable results at every step. These qualities highlight its potential as a scalable, ethically sound alternative to live user experiments, supporting iterative and user-centric evaluations of RL-based recommender strategies. Looking ahead, we envision Lusifer serving as a foundational tool for exploring generative AI-driven user simulations, enabling more adaptive and personalized recommendation pipelines under real world constraints.
翻译:强化学习(RL)推荐系统通常依赖于静态数据集,这些数据集难以捕捉现实场景中用户偏好流动且不断变化的特性。与此同时,生成式人工智能技术已成为创建合成数据(包括用户画像和行为)的强大工具。基于这一潜力,我们提出了Lusifer——一个基于大语言模型(LLM)的模拟环境,旨在为基于强化学习的推荐系统训练生成动态、真实的用户反馈。在Lusifer中,用户画像在每次交互步骤中逐步更新,大语言模型可透明地解释偏好如何及为何演变。我们聚焦于MovieLens数据集,仅提取每位用户最近的40次交互以强调近期行为。通过处理文本元数据(如电影概述和标签),Lusifer能够创建更具情境感知能力的用户状态,并对新项目(包括那些历史评分有限或无评分的项目)模拟反馈。该方法降低了对大量历史数据的依赖,有助于冷启动场景处理及分布外情况的适应。我们的实验将Lusifer与传统协同过滤模型进行比较,结果表明:虽然Lusifer在预测准确性上可能相当,但其在捕捉动态用户响应及每一步生成可解释结果方面表现卓越。这些特性凸显了其作为可扩展、符合伦理的实时用户实验替代方案的潜力,能够支持基于强化学习的推荐策略进行迭代化、以用户为中心的评估。展望未来,我们期待Lusifer成为探索生成式AI驱动用户模拟的基础工具,在现实世界约束下实现更具适应性和个性化的推荐流程。