Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.
翻译:近期研究表明,大语言模型(LLMs)可通过调用外部工具增强自身能力,但仍有三个关键问题尚未解决:(1)当前LLMs在工具利用方面的实际效能如何?(2)如何提升LLMs的工具利用能力?(3)实现工具利用需克服哪些障碍?为解答上述问题,我们提出专为工具增强型LLMs设计的开创性基准——API-Bank。针对第一个问题,我们构建了包含73个API工具的可运行评估系统,标注了314段包含753次API调用的工具使用对话,用以评估现有LLMs在API规划、检索及调用方面的能力。针对第二个问题,我们构建了涵盖1,000个不同领域、2,138个API的综合性训练集,包含1,888段工具使用对话。基于该数据集,我们训练了以Alpaca模型初始化的工具增强型LLM——Lynx。实验结果表明:GPT-3.5相较GPT-3在工具利用能力上有所提升,GPT-4则在任务规划方面表现卓越,但模型仍有显著提升空间。此外,Lynx在工具利用性能上较Alpaca提升超过26个百分点,并接近GPT-3.5的表现水平。通过错误分析,我们揭示了该领域未来研究需重点攻克的关键挑战,以回应第三个问题。