User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and proactively engages in negotiation by challenging or bargaining. However, current methods exhibit two issues. They rely on static and context-unaware profiles, necessitating extensive manual redesign for new scenarios, thus limiting generalizability. Moreover, they neglect human strategic thinking, leading to vulnerability to agent manipulation. To address these issues, we propose UserLM-R1, a novel user language model with reasoning capability. Specifically, we first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios. Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses, and further refine the reasoning and improve strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning. Extensive experimental results demonstrate that UserLM-R1 outperforms competitive baselines, particularly on the more challenging adversarial set.
翻译:用户模拟器是智能体后训练的关键交互环境,理想的用户模拟器应具备跨领域泛化能力,并能主动通过质疑或讨价还价进行协商。然而,现有方法存在两个问题:它们依赖静态且缺乏上下文感知的用户画像,需要针对新场景进行大量人工重新设计,从而限制了泛化能力;此外,它们忽略了人类的策略性思维,导致易受智能体操控。为解决这些问题,我们提出了UserLM-R1——一种具备推理能力的新型用户语言模型。具体而言,我们首先构建包含静态角色和动态场景特定目标的综合性用户画像,以适应多样化场景。随后,我们提出一种目标驱动的决策策略,在生成回复前先产生高质量推理依据,并通过监督微调和多奖励强化学习进一步优化推理过程、提升策略能力。大量实验结果表明,UserLM-R1在多个基准测试中均优于现有基线模型,尤其在更具挑战性的对抗性测试集上表现突出。