User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

Conversational Recommender Systems (CRSs) leverage natural language interactions for personalized recommendation, yet information-scarce dialogue histories and single-turn recommendation paradigms may severely hinder accurate modeling of complex user preferences. To alleviate this issue, recent studies have introduced LLM-based user simulators, which generate natural language feedback and perform simulated multi-turn interactions to assist recommendation. Nevertheless, since simulators cannot access true user preference labels during inference, their feedback may deviate from actual user interests, causing errors to accumulate over multiple interactions and severely affecting the generalization of the recommender. Inspired by the multi-step reasoning capabilities of LLMs and the effectiveness of reinforcement learning in policy optimization, we propose SMTPO, a user simulator-guided multi-turn preference optimization conversational recommendation framework. To align simulator-generated feedback with true user preferences in the absence of explicit labels, we enhance feedback quality via multi-task supervised fine-tuning (SFT), enabling the simulator to better reflect users' complex and diverse needs. To address the challenge of biased feedback destabilizing multi-turn optimization, we first allow the reasoning LLM-based recommender to learn preference reasoning and recommendation patterns through SFT and then employ reinforcement learning with fine-grained reward design to progressively align with true user preferences, improving recommendation performance. Extensive experiments on public datasets demonstrate the effectiveness and transferability of our method.

翻译：对话式推荐系统（CRS）利用自然语言交互实现个性化推荐，但信息匮乏的对话历史与单轮推荐范式可能严重阻碍对复杂用户偏好的准确建模。为缓解该问题，近期研究引入基于大语言模型的用户模拟器，通过生成自然语言反馈并执行模拟多轮交互来辅助推荐。然而，由于模拟器在推理过程中无法获取真实的用户偏好标签，其反馈可能偏离实际用户兴趣，导致误差在多轮交互中累积，严重影响推荐器的泛化能力。受大语言模型多步推理能力与强化学习在策略优化中有效性的启发，我们提出SMTPO——一种用户模拟器引导的多轮偏好优化对话式推荐框架。为在无显式标签的情况下使模拟器生成的反馈与真实用户偏好对齐，我们通过多任务监督微调（SFT）提升反馈质量，使模拟器能更好反映用户复杂多样的需求。针对有偏反馈破坏多轮优化稳定性的挑战，我们首先让基于推理LLM的推荐器通过SFT学习偏好推理与推荐模式，随后采用具有细粒度奖励设计的强化学习逐步对齐真实用户偏好，提升推荐性能。在公开数据集上的大量实验验证了所提方法的有效性与可迁移性。