Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, failing to train and test agents against non-collaborative users in the real world. We propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and τ-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users, as well as agent weaknesses under each non-collaborative condition such as escalated hallucinations and dialogue breakdowns. Our findings point to the need for methods that can improve agent robustness to the wide range of user behaviors encountered in deployment. We release the extensible simulation framework to help the community develop and stress-test tool agents under realistic conditions within their own service domains. Our code is available at https://github.com/holi-lab/NCUser.
翻译:工具代理通过多轮对话与用户交互以完成各类任务。近期研究采用用户模拟方法在多轮对话场景中开发此类代理。然而,现有用户模拟器往往偏向代理友好型,仅呈现协作行为,无法针对现实世界中非协作用户对代理进行训练与测试。本文提出一种新型用户模拟器架构,能够模拟四类非协作行为:请求不可用服务、偏离主题的闲谈、表现不耐烦情绪以及提供不完整话语。我们的用户模拟器在可靠传递完成任务所需全部意图与信息的同时,能够模拟具有挑战性且自然的非协作行为。在MultiWOZ和τ-bench数据集上的实验表明,当前最先进的工具代理在遭遇非协作用户时性能显著下降,且在不同非协作条件下均暴露出代理的弱点,例如幻觉现象加剧与对话崩溃。我们的研究结果表明,需要开发能够提升代理对部署中各类用户行为鲁棒性的方法。我们发布了可扩展的模拟框架,以帮助研究社区在其自身服务领域内基于真实场景开发工具代理并进行压力测试。代码已发布于 https://github.com/holi-lab/NCUser。