Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.
翻译:增强大语言模型以使用外部工具,可提升其在多种任务上的性能。然而,先前的工作过度依赖工具使用的任务特定示例,这限制了其泛化能力,且因多次调用大规模LLM导致计算成本高昂。我们提出GEAR,一种计算高效的查询-工具对齐算法,能泛化至各种需要工具使用的任务,且无需任务特定示例。GEAR通过分别将工具对齐和执行委托给小语言模型和LLM,实现了更高效率;同时利用基于语义和模式的评估在问题与答案两个层面进行可泛化的工具对齐。我们在6个下游任务的14个数据集上评估GEAR,证明了其对新型任务、工具及不同SLM的强泛化能力。尽管效率更高,GEAR在工具对齐上相比先前使用LLM提示的策略仍实现了更高精度,从而以更低的计算成本提升下游准确率。例如,我们证明了经GEAR增强的GPT-J和GPT-3因更好的工具使用能力,其性能优于对应的工具增强基线模型。