The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume 'perfect' information access and tool availability, which may not hold in the real world. To systematically study TaLMs' imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.
翻译:工具集成已扩展了语言模型(LMs)的能力,使其超越了单纯的文本生成,适用于多种场景。然而,工具增强语言模型(TaLMs)通常假设“完美”的信息访问和工具可用性,这在现实世界中可能并不成立。为了系统研究TaLMs的缺陷,我们引入了FAIL-TALMS基准测试,重点关注两大失败类型:用户查询不明确和工具不可用。FAIL-TALMS包含1,749个示例,涉及21个类别的906种工具,涵盖单工具和多工具使用场景。我们评估了表现优异的专有模型和开源模型,发现除Claude外,当前所有模型均难以识别缺失的工具或信息。此外,为研究缓解这些失败的可能方法,我们引入了实时人机交互机制,称为“询问与帮助”(AAH)方法,以提供缺失信息或替换失效工具。虽然AAH能在查询不明确时帮助模型更准确地解决任务,但当复杂工具损坏时,其带来的益处微乎其微。