API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.

翻译：近期研究表明，大语言模型（LLMs）可通过调用外部工具增强自身能力，但仍有三个关键问题尚未解决：（1）当前LLMs在工具利用方面的实际效能如何？（2）如何提升LLMs的工具利用能力？（3）实现工具利用需克服哪些障碍？为解答上述问题，我们提出专为工具增强型LLMs设计的开创性基准——API-Bank。针对第一个问题，我们构建了包含73个API工具的可运行评估系统，标注了314段包含753次API调用的工具使用对话，用以评估现有LLMs在API规划、检索及调用方面的能力。针对第二个问题，我们构建了涵盖1,000个不同领域、2,138个API的综合性训练集，包含1,888段工具使用对话。基于该数据集，我们训练了以Alpaca模型初始化的工具增强型LLM——Lynx。实验结果表明：GPT-3.5相较GPT-3在工具利用能力上有所提升，GPT-4则在任务规划方面表现卓越，但模型仍有显著提升空间。此外，Lynx在工具利用性能上较Alpaca提升超过26个百分点，并接近GPT-3.5的表现水平。通过错误分析，我们揭示了该领域未来研究需重点攻克的关键挑战，以回应第三个问题。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日