Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

The recent trend of using Large Language Models (LLMs) as intelligent agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs' ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset during planning. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field. The benchmark is publicly available at https://github.com/JoeYing1019/UltraTool.

翻译：近期将大语言模型（LLMs）作为智能体应用于真实世界场景的趋势，凸显了对其能力进行全面评估的必要性，特别是在涉及工具规划、创建与使用的复杂场景中。然而，现有基准测试通常聚焦于无法反映真实世界复杂性的简单合成查询，从而在评估工具利用能力方面视角有限。为解决此问题，我们提出了UltraTool——一个旨在提升并评估LLMs在真实场景中工具利用能力的新型基准测试。UltraTool聚焦于工具使用的完整流程，从规划、创建到应用于复杂任务，强调真实世界的复杂性，要求实现准确的多步规划以有效解决问题。其关键特性在于独立评估自然语言层面的规划能力——在工具使用前完成规划，通过梳理中间步骤简化任务求解。与以往工作不同，该方法在规划阶段消除了预定义工具集的限制。通过在不同LLMs上开展的大量实验，我们为评估LLMs的工具利用能力提供了新颖见解，从而为这一快速发展的领域贡献了全新视角。该基准测试已在https://github.com/JoeYing1019/UltraTool 公开。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日