A Comprehensive Evaluation of Tool-Assisted Generation Strategies

A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.

翻译：越来越多研究探索通过增强语言模型使用工具（如搜索引擎、计算器）来克服其缺陷（如知识缺失或错误、逻辑推理错误）。目前已有多种少样本工具使用策略被提出，但缺乏对不同策略之间以及这些策略与未使用工具的强基线方法之间的系统、公平对比。我们通过广泛的实证分析发现：（1）在各种数据集、示例难度级别和模型上，不使用工具的强基线方法与工具辅助策略相比具有竞争力，这表明利用情境示例有效使用工具仍是一个尚未解决的难题；（2）在知识检索任务中，利用工具*修正*错误输出的策略优于*在生成前或生成中*检索相关信息的策略；（3）工具辅助策略在运行所需的token数量上代价高昂——额外成本高达数个数量级——但并未转化为显著的性能提升。总体而言，我们的研究结果表明，少样本工具整合仍是一个开放挑战，这凸显了对未来策略进行综合评估以准确衡量其*收益*与*成本*的必要性。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日