Tool-calling agents are increasingly deployed in real-world customer-facing workflows. Yet most studies on tool-calling agents focus on idealized settings with general, fixed, and well-specified tasks. In real-world applications, user requests are often (1) ambiguous, (2) changing over time, or (3) infeasible due to policy constraints, and training and evaluation data that cover these diverse, complex interaction patterns remain under-represented. To bridge the gap, we present Trajectory2Task, a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents. The pipeline first conducts multi-turn exploration to produce valid tool-call trajectories. It then converts these trajectories into user-facing tasks with controlled intent adaptations. This process yields verifiable task that support closed-loop evaluation and training. We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures. Finally, using successful trajectories obtained from task rollouts, we fine-tune lightweight LLMs and find consistent improvements across all three conditions, along with better generalization to unseen tool-use domains, indicating stronger general tool-calling ability.
翻译:工具调用智能体正日益广泛应用于面向客户的现实工作流程中。然而,当前大多数关于工具调用智能体的研究仍聚焦于理想化场景,即处理通用、固定且明确指定的任务。在实际应用中,用户请求往往具有以下特征:(1) 意图模糊,(2) 随时间动态变化,或(3) 因策略约束而不可行。而涵盖这些多样化复杂交互模式的训练与评估数据仍严重不足。为弥合这一差距,我们提出了Trajectory2Task——一个可验证的数据生成流程,用于在三种现实用户场景(模糊意图、动态意图及不可行意图)下大规模研究工具使用。该流程首先通过多轮探索生成有效的工具调用轨迹,随后将这些轨迹转化为具有可控意图适配的用户导向任务。此过程产生可验证的任务数据,支持闭环评估与训练。我们在生成的复杂用户场景任务上对七种前沿大语言模型进行基准测试,观察到频繁的失败案例。最后,利用从任务执行中获得的成功轨迹对轻量化大语言模型进行微调,发现在所有三种场景下均取得稳定性能提升,同时模型在未见过的工具使用领域展现出更好的泛化能力,表明其获得了更强的通用工具调用能力。