Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $\textit{world model}$ to predict the world $\textit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, $\underline{R}\textit{easoning vi}\underline{a} \underline{P}\textit{lanning}$ $\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration $\textit{vs.}$ exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.
翻译:大型语言模型(LLMs)在推理能力上已展现出显著成效,特别是在被提示生成中间推理步骤(例如思维链CoT)时。然而,对于人类易于解决的问题,如为在特定环境中执行任务生成行动规划,或完成复杂数学、逻辑及常识推理,LLMs仍可能存在困难。这一缺陷源于一个关键事实:LLMs缺乏内在的$\textit{世界模型}$来预测世界$\textit{状态}$(例如环境状况、中间变量值)并模拟行动的长期结果。这阻碍了LLMs执行类似于人脑的深思熟虑式规划,包括探索备选推理路径、预测未来状态与回报、以及迭代优化现有推理步骤。为克服这些限制,我们提出了一种新型LLM推理框架——$\underline{R}\textit{easoning vi}\underline{a} \underline{P}\textit{lanning}$ $\textbf{(RAP)}$。RAP将LLM重新定位为世界模型和推理代理,并融合了基于原则的规划算法(基于蒙特卡洛树搜索),以在广阔的推理空间中进行战略性探索。在推理过程中,LLM(作为代理)在LLM(作为世界模型)和任务特定奖励的引导下逐步构建推理树,通过适当平衡探索$\textit{与}$利用,高效获取高奖励推理路径。我们将RAP应用于多种具有挑战性的推理问题,包括规划生成、数学推理和逻辑推理。这些任务上的实验结果表明,RAP在各项强基线方法(包括带自一致性的CoT和由少到多提示)中均展现出优越性。在规划生成任务中,基于LLAMA-33B的RAP相比基于GPT-4的CoT实现了33%的相对性能提升。