Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.
翻译:工具使用能力是大语言模型智能体的核心组成部分,使其能够通过结构化函数调用与外部系统交互。然而,现有研究存在交互表示不一致、严重忽视工具使用轨迹的结构分布、以及依赖不兼容的评估基准等问题。我们提出UniToolCall,一个统一工具学习框架,将工具集构建、数据集生成到评估的完整流程标准化。该框架构建了包含2.2万+工具的大型工具池,通过整合10个标准化公开数据集与结构可控的合成轨迹,构造了包含39万+实例的混合训练语料。它显式建模了包括单跳与多跳、单轮与多轮在内的多样化交互模式,同时捕捉串行与并行执行结构。为支持连贯的多轮推理,我们进一步引入锚点链接机制来强化跨轮依赖。此外,我们将7个公开基准转化为统一的查询-动作-观察-答案表示,并在函数调用、轮次和对话层面进行细粒度评估。实验表明,在Qwen3-8B上微调我们的数据集能显著提升工具使用性能。在干扰密集的Hybrid-20设置下,单轮严格精确率达到93.0%,超越了包括GPT、Gemini和Claude在内的商业模型。