Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.
翻译:摘要:将大语言模型与外部工具结合使用,可提升其在各类任务中的表现。然而,现有方法过度依赖工具使用的任务特定示例,这不仅限制了其泛化性,还因频繁调用大规模语言模型而增加了计算成本。本文提出GEAR——一种计算高效的查询-工具对齐算法,该算法可泛化至需要工具使用的各类任务,且无需依赖任务特定示例。GEAR通过将工具对齐与执行分别委托给小语言模型和大语言模型实现更高效率;同时,它在问题和答案两个层面融合基于语义与模式的评估机制,实现可泛化的工具对齐。我们在6个下游任务的14个数据集上评估了GEAR,证明其对新任务、新工具及不同小语言模型均具有强泛化能力。尽管效率更高,GEAR在工具对齐精度上仍优于此前采用大语言模型提示的策略,从而以更低计算成本提升下游任务准确性。例如,实验表明,经GEAR增强的GPT-J和GPT-3因更优的工具使用能力,其性能优于同类工具增强基线模型。