工具增强语言模型的基准测试失败研究 (Benchmarking Failures in Tool-Augmented Language Models)

The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume 'perfect' information access and tool availability, which may not hold in the real world. To systematically study TaLMs' imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.

翻译：工具集成已扩展了语言模型（LMs）的能力，使其超越了单纯的文本生成，适用于多种场景。然而，工具增强语言模型（TaLMs）通常假设“完美”的信息访问和工具可用性，这在现实世界中可能并不成立。为了系统研究TaLMs的缺陷，我们引入了FAIL-TALMS基准测试，重点关注两大失败类型：用户查询不明确和工具不可用。FAIL-TALMS包含1,749个示例，涉及21个类别的906种工具，涵盖单工具和多工具使用场景。我们评估了表现优异的专有模型和开源模型，发现除Claude外，当前所有模型均难以识别缺失的工具或信息。此外，为研究缓解这些失败的可能方法，我们引入了实时人机交互机制，称为“询问与帮助”（AAH）方法，以提供缺失信息或替换失效工具。虽然AAH能在查询不明确时帮助模型更准确地解决任务，但当复杂工具损坏时，其带来的益处微乎其微。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日