The recent trend of using Large Language Models (LLMs) as intelligent agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs' ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset during planning. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field. The benchmark is publicly available at https://github.com/JoeYing1019/UltraTool.
翻译:近期将大语言模型(LLMs)作为智能体应用于真实世界场景的趋势,凸显了对其能力进行全面评估的必要性,特别是在涉及工具规划、创建与使用的复杂场景中。然而,现有基准测试通常聚焦于无法反映真实世界复杂性的简单合成查询,从而在评估工具利用能力方面视角有限。为解决此问题,我们提出了UltraTool——一个旨在提升并评估LLMs在真实场景中工具利用能力的新型基准测试。UltraTool聚焦于工具使用的完整流程,从规划、创建到应用于复杂任务,强调真实世界的复杂性,要求实现准确的多步规划以有效解决问题。其关键特性在于独立评估自然语言层面的规划能力——在工具使用前完成规划,通过梳理中间步骤简化任务求解。与以往工作不同,该方法在规划阶段消除了预定义工具集的限制。通过在不同LLMs上开展的大量实验,我们为评估LLMs的工具利用能力提供了新颖见解,从而为这一快速发展的领域贡献了全新视角。该基准测试已在https://github.com/JoeYing1019/UltraTool 公开。