Recent research has shown that Large Language Models (LLMs) can utilize external tools to improve their contextual processing abilities, moving away from the pure language modeling paradigm and paving the way for Artificial General Intelligence. Despite this, there has been a lack of systematic evaluation to demonstrate the efficacy of LLMs using tools to respond to human instructions. This paper presents API-Bank, the first benchmark tailored for Tool-Augmented LLMs. API-Bank includes 53 commonly used API tools, a complete Tool-Augmented LLM workflow, and 264 annotated dialogues that encompass a total of 568 API calls. These resources have been designed to thoroughly evaluate LLMs' ability to plan step-by-step API calls, retrieve relevant APIs, and correctly execute API calls to meet human needs. The experimental results show that GPT-3.5 emerges the ability to use the tools relative to GPT3, while GPT-4 has stronger planning performance. Nevertheless, there remains considerable scope for further improvement when compared to human performance. Additionally, detailed error analysis and case studies demonstrate the feasibility of Tool-Augmented LLMs for daily use, as well as the primary challenges that future research needs to address.
翻译:近年研究表明,大语言模型可通过调用外部工具提升上下文处理能力,由此突破纯语言建模范式,为通用人工智能的实现开辟新路径。然而,目前尚缺乏系统性的评估来验证大语言模型利用工具响应人类指令的有效性。本文提出API-Bank——首个专为工具增强型大语言模型设计的基准测试。该基准包含53种常用API工具、完整的工具增强型大语言模型工作流程,以及264个标注对话(共计568次API调用)。这些资源旨在全面评估大语言模型的三种能力:分步骤规划API调用、检索相关API、准确执行API调用以满足人类需求。实验结果表明,GPT-3.5相较GPT-3展现出更强的工具使用能力,而GPT-4的规划性能更为突出。但相较于人类表现,模型仍有显著提升空间。此外,详细的错误分析与案例研究证实了工具增强型大语言模型在日常场景中的可行性,并揭示了未来研究亟需应对的核心挑战。