Algorithmic reasoning refers to the ability to understand the complex patterns behind the problem and decompose them into a sequence of reasoning steps towards the solution. Such nature of algorithmic reasoning makes it a challenge for large language models (LLMs), even though they have demonstrated promising performance in other reasoning tasks. Within this context, some recent studies use programming languages (e.g., Python) to express the necessary logic for solving a given instance/question (e.g., Program-of-Thought) as inspired by their strict and precise syntaxes. However, it is non-trivial to write an executable code that expresses the correct logic on the fly within a single inference call. Also, the code generated specifically for an instance cannot be reused for others, even if they are from the same task and might require identical logic to solve. This paper presents Think-and-Execute, a novel framework that decomposes the reasoning process of language models into two steps. (1) In Think, we discover a task-level logic that is shared across all instances for solving a given task and then express the logic with pseudocode; (2) In Execute, we further tailor the generated pseudocode to each instance and simulate the execution of the code. With extensive experiments on seven algorithmic reasoning tasks, we demonstrate the effectiveness of Think-and-Execute. Our approach better improves LMs' reasoning compared to several strong baselines performing instance-specific reasoning (e.g., CoT and PoT), suggesting the helpfulness of discovering task-level logic. Also, we show that compared to natural language, pseudocode can better guide the reasoning of LMs, even though they are trained to follow natural language instructions.
翻译:算法推理指理解问题背后的复杂模式并将其分解为一系列推理步骤以求解的能力。这种特性使得算法推理对大型语言模型(LLMs)构成挑战,尽管它们在其它推理任务中已展现出良好性能。在此背景下,受编程语言严谨精确语法的启发,近期研究开始使用编程语言(如Python)来表达解决特定实例/问题所需的逻辑(如思维程序)。然而,在单次推理调用中即时编写可执行代码来准确表达逻辑并非易事。此外,为特定实例生成的代码无法复用于其他实例,即便这些实例属于同一任务且可能需用相同逻辑求解。本文提出"思考-执行"(Think-and-Execute)框架,将语言模型的推理过程分解为两个步骤:(1)在"思考"阶段,我们发掘解决给定任务时所有实例共享的任务级逻辑,并以伪代码表达该逻辑;(2)在"执行"阶段,我们将生成的伪代码适配至每个具体实例,并模拟代码执行过程。通过在七项算法推理任务上的大量实验,我们验证了"思考-执行"框架的有效性。与执行实例级推理的多项强基线方法(如思维链和思维程序)相比,我们的方法能更好提升语言模型的推理能力,表明发现任务级逻辑具有重要价值。同时,我们证明相较于自然语言,伪代码能更有效引导语言模型的推理过程,尽管它们经过指令训练以遵循自然语言指令。