Teaching language models to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, language models still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during (1) tool selection; and (2) parameter generation. We construct synthetic, high-quality, self-generated data for this goal using Llama-2 70B, which we intend to release publicly. Extensive experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines, even in scenarios where the distinctions between candidate tools are finely nuanced.
翻译:教会语言模型使用工具是构建通用助手的重要里程碑,但仍是一个开放性问题。尽管通过微调学习使用特定工具已取得显著进展,但语言模型仍难以仅通过少量示例稳健地学习使用新工具。在本工作中,我们提出了一种自验证方法,通过在(1)工具选择和(2)参数生成过程中自我提问对比性问题来区分相近候选者。我们使用Llama-2 70B构建了用于此目标的高质量合成自生成数据集,并计划公开该数据集。在ToolBench基准测试的4个任务(涉及17种未见工具)上的大量实验表明,即使在候选工具间差异细微的场景下,该方法相比少样本基线平均提升22%。