MetaTool Benchmark: Deciding Whether to Use Tools and Which to Use

Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving nine popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers that follow ChatGPT to provide detailed descriptions that can enhance the tool selection performance of LLMs.

翻译：大型语言模型（LLMs）因其卓越的自然语言处理（NLP）能力而备受关注。近期，大量研究聚焦于LLMs的工具使用能力，主要探讨LLMs如何有效协作使用特定工具。然而，在LLMs作为智能代理的场景中（如AutoGPT和MetaGPT等应用），LLMs需要参与复杂的决策过程，包括决定是否使用工具，以及从可用工具集中选择最合适的工具来满足用户需求。为此，本文提出元工具（MetaTool）基准，旨在评估LLMs是否具有工具使用意识并能正确选择工具。具体而言，我们在该基准内构建了一个名为ToolE的数据集，该数据集包含以提示形式触发的多种用户查询类型，涵盖单工具和多工具场景。随后，我们设定了工具使用意识与工具选择两类任务。在工具选择任务中，我们从不同角度定义了四个子任务：相似选项下的工具选择、特定场景下的工具选择、存在可靠性问题的工具选择以及多工具选择。我们针对九个主流LLMs进行了实验，发现多数模型仍难以有效选择工具，揭示了LLMs与真正智能代理之间的现有差距。然而，通过错误分析，我们发现仍存在显著的改进空间。最后，我们向遵循ChatGPT模式的工具开发者提出建议：提供详细描述可增强LLMs的工具选择性能。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日