Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities. Recently, many studies have focused on the tool utilization ability of LLMs. They primarily investigated how LLMs effectively collaborate with given specific tools. However, in scenarios where LLMs serve as intelligent agents, as seen in applications like AutoGPT and MetaGPT, LLMs are expected to engage in intricate decision-making processes that involve deciding whether to employ a tool and selecting the most suitable tool(s) from a collection of available tools to fulfill user requests. Therefore, in this paper, we introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. Specifically, we create a dataset called ToolE within the benchmark. This dataset contains various types of user queries in the form of prompts that trigger LLMs to use tools, including both single-tool and multi-tool scenarios. Subsequently, we set the tasks for both tool usage awareness and tool selection. We define four subtasks from different perspectives in tool selection, including tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools, highlighting the existing gaps between LLMs and genuine intelligent agents. However, through the error analysis, we found there is still significant room for improvement. Finally, we conclude with insights for tool developers -- we strongly recommend that tool developers choose an appropriate rewrite model for generating new descriptions based on the downstream LLM the tool will apply to. Our code is in https://github.com/HowieHwong/MetaTool.
翻译:大型语言模型(LLMs)凭借其卓越的自然语言处理(NLP)能力获得了广泛关注。近期,许多研究聚焦于LLMs的工具利用能力,主要探讨LLMs如何与给定的特定工具有效协作。然而,在LLMs作为智能代理的场景中(如AutoGPT和MetaGPT等应用),LLMs需参与复杂的决策过程,包括判断是否需要使用工具,以及从可用工具集合中选择最合适的工具来满足用户需求。为此,本文提出MetaTool基准测试,旨在评估LLMs是否具备工具使用意识并能正确选择工具。具体而言,我们在基准测试中创建了名为ToolE的数据集,该数据集以提示形式包含各类可能触发LLMs使用工具的用户查询,涵盖单工具与多工具场景。随后,我们设定了工具使用意识与工具选择两项任务。在工具选择任务中,我们从不同角度定义了四个子任务:相似选项下的工具选择、特定场景中的工具选择、存在可靠性问题的工具选择以及多工具选择。通过对八个主流LLMs的实验,我们发现大多数模型仍难以有效选择工具,这揭示了LLMs与真正智能代理之间的现存差距。但通过错误分析,我们发现模型仍有显著改进空间。最后,我们为工具开发者提出关键建议:强烈建议工具开发者根据目标下游LLM的特性,选择合适的重写模型来生成工具描述文本。代码已开源:https://github.com/HowieHwong/MetaTool。