How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

Conversational Recommender System (CRS) interacts with users through natural language to understand their preferences and provide personalized recommendations in real-time. CRS has demonstrated significant potential, prompting researchers to address the development of more realistic and reliable user simulators as a key focus. Recently, the capabilities of Large Language Models (LLMs) have attracted a lot of attention in various fields. Simultaneously, efforts are underway to construct user simulators based on LLMs. While these works showcase innovation, they also come with certain limitations that require attention. In this work, we aim to analyze the limitations of using LLMs in constructing user simulators for CRS, to guide future research. To achieve this goal, we conduct analytical validation on the notable work, iEvaLM. Through multiple experiments on two widely-used datasets in the field of conversational recommendation, we highlight several issues with the current evaluation methods for user simulators based on LLMs: (1) Data leakage, which occurs in conversational history and the user simulator's replies, results in inflated evaluation results. (2) The success of CRS recommendations depends more on the availability and quality of conversational history than on the responses from user simulators. (3) Controlling the output of the user simulator through a single prompt template proves challenging. To overcome these limitations, we propose SimpleUserSim, employing a straightforward strategy to guide the topic toward the target items. Our study validates the ability of CRS models to utilize the interaction information, significantly improving the recommendation results.

翻译：会话推荐系统（CRS）通过自然语言与用户交互，实时理解用户偏好并提供个性化推荐。CRS展现出巨大潜力，促使研究者将开发更真实可靠的用户模拟器作为关键研究方向。近期，大语言模型（LLM）的能力在各领域引起广泛关注，同时基于LLM构建用户模拟器的努力也在持续推进。虽然这些工作展现了创新性，但同时也存在需要关注的局限性。本研究旨在系统分析基于LLM构建CRS用户模拟器的局限性，以指导未来研究。为此，我们对代表性工作iEvaLM进行了分析验证。通过在会话推荐领域两个广泛使用的数据集上开展多项实验，我们揭示了当前基于LLM的用户模拟器评估方法存在的若干问题：（1）对话历史及用户模拟器回复中存在数据泄露问题，导致评估结果虚高；（2）CRS推荐的成功更依赖于对话历史的质量与可用性，而非用户模拟器的响应；（3）通过单一种子模板控制用户模拟器输出具有挑战性。为克服这些局限，我们提出SimpleUserSim方法，采用简洁策略引导话题向目标物品转移。本研究验证了CRS模型利用交互信息的能力，显著提升了推荐效果。