Reasoning with Language Model is Planning with World Model

Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $\textit{world model}$ to predict the world $\textit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, $\underline{R}\textit{easoning vi}\underline{a} \underline{P}\textit{lanning}$ $\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration $\textit{vs.}$ exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.

翻译：大型语言模型（LLMs）在推理能力上已展现出显著成效，特别是在被提示生成中间推理步骤（例如思维链CoT）时。然而，对于人类易于解决的问题，如为在特定环境中执行任务生成行动规划，或完成复杂数学、逻辑及常识推理，LLMs仍可能存在困难。这一缺陷源于一个关键事实：LLMs缺乏内在的$\textit{世界模型}$来预测世界$\textit{状态}$（例如环境状况、中间变量值）并模拟行动的长期结果。这阻碍了LLMs执行类似于人脑的深思熟虑式规划，包括探索备选推理路径、预测未来状态与回报、以及迭代优化现有推理步骤。为克服这些限制，我们提出了一种新型LLM推理框架——$\underline{R}\textit{easoning vi}\underline{a} \underline{P}\textit{lanning}$ $\textbf{(RAP)}$。RAP将LLM重新定位为世界模型和推理代理，并融合了基于原则的规划算法（基于蒙特卡洛树搜索），以在广阔的推理空间中进行战略性探索。在推理过程中，LLM（作为代理）在LLM（作为世界模型）和任务特定奖励的引导下逐步构建推理树，通过适当平衡探索$\textit{与}$利用，高效获取高奖励推理路径。我们将RAP应用于多种具有挑战性的推理问题，包括规划生成、数学推理和逻辑推理。这些任务上的实验结果表明，RAP在各项强基线方法（包括带自一致性的CoT和由少到多提示）中均展现出优越性。在规划生成任务中，基于LLAMA-33B的RAP相比基于GPT-4的CoT实现了33%的相对性能提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/