Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.

翻译：现今，大型语言模型通常通过提供少量工具使用示例来学习使用新工具。然而，示例获取困难，且若选择错误示例可能导致不期望的偏向性使用。即便在示例易于获得的罕见情形下，也缺乏确定提供示例数量与如何选择的原则性策略。随着任务复杂度增加，选择搜索呈组合式增长且必然变得棘手。本研究提出替代示例的方案：工具文档。我们主张使用工具文档（即对单个工具使用的描述）替代示例。通过涵盖视觉与语言两种模态的6项任务中三项主要实证发现，我们论证了该主张。第一，在现有基准测试中，仅含工具文档的零样本提示即可有效引发正确工具使用，性能与少样本提示相当。第二，在包含数百个可用工具API的新构建现实工具使用数据集上，我们发现工具文档显著优于示例——零样本文档性能大幅超越无文档的少样本方法。第三，通过使用最新发布且未见过的先进模型作为工具进行图像生成与视频追踪，我们凸显了工具文档的优势。最后，我们指出利用工具文档自动实现新应用的可能性：仅需利用GroundingDino、Stable Diffusion、XMem和SAM的文档，大型语言模型即可重现刚发布的Grounded-SAM与Track Anything模型的功能。