A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.
翻译:越来越多研究探索通过增强语言模型使用工具(如搜索引擎、计算器)来克服其缺陷(如知识缺失或错误、逻辑推理错误)。目前已有多种少样本工具使用策略被提出,但缺乏对不同策略之间以及这些策略与未使用工具的强基线方法之间的系统、公平对比。我们通过广泛的实证分析发现:(1)在各种数据集、示例难度级别和模型上,不使用工具的强基线方法与工具辅助策略相比具有竞争力,这表明利用情境示例有效使用工具仍是一个尚未解决的难题;(2)在知识检索任务中,利用工具*修正*错误输出的策略优于*在生成前或生成中*检索相关信息的策略;(3)工具辅助策略在运行所需的token数量上代价高昂——额外成本高达数个数量级——但并未转化为显著的性能提升。总体而言,我们的研究结果表明,少样本工具整合仍是一个开放挑战,这凸显了对未来策略进行综合评估以准确衡量其*收益*与*成本*的必要性。