Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code). Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation. ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and matches performance of hand-crafted CoT prompts on a majority of these tasks. ART is also extensible, and makes it easy for humans to improve performance by correcting errors in task-specific programs or incorporating new tools, which we demonstrate by drastically improving performance on select tasks with minimal human intervention.
翻译:大型语言模型(LLMs)能够通过生成中间思维链(CoT)推理步骤,在少样本和零样本场景下执行复杂推理。此外,每个推理步骤可依赖外部工具(如搜索/代码运行)来支持超越核心LLM能力的计算。先前关于CoT提示和工具使用的研究通常需要手工制作任务特定的示例,并精心脚本化模型生成与工具使用的交错执行。我们提出自动推理与工具使用(ART)框架,该框架利用冻结的LLMs自动将中间推理步骤生成为程序。给定待解决的新任务时,ART从任务库中选取多步推理和工具使用的示例。测试阶段,ART在调用外部工具时无缝暂停生成,并在集成工具输出后恢复生成。在BigBench和MMLU基准测试中,ART在未见任务上相比少样本提示和自动CoT取得了显著改进,并在大多数任务上达到了手工制作CoT提示的性能水平。ART还具有可扩展性,可通过修正特定任务程序中的错误或集成新工具来轻松提升性能——我们通过最小化人工干预即可显著改善选定任务性能的案例证明了这一点。