While language models (LMs) offer significant capability in zero-shot reasoning tasks across a wide range of domains, they do not perform satisfactorily in problems which requires multi-step reasoning. Previous approaches to mitigate this involves breaking a larger, multi-step task into sub-tasks and asking the language model to generate proposals ("thoughts") for each sub-task and using exhaustive planning approaches such as DFS to compose a solution. In this work, we leverage this idea to introduce two new contributions: first, we formalize a planning-based approach to perform multi-step problem solving with LMs via Partially Observable Markov Decision Processes (POMDPs), with the LM's own reflections about the value of a state used as a search heuristic; second, leveraging the online POMDP solver POMCP, we demonstrate a superior success rate of 89.4% on the Game of 24 task as compared to existing approaches while also offering better anytime performance characteristics than fixed tree-search which is used previously. Taken together, these contributions allow modern LMs to decompose and solve larger-scale reasoning tasks more effectively.
翻译:尽管语言模型(LMs)在跨领域的零样本推理任务中展现出显著能力,但在需要多步推理的问题上表现欠佳。现有缓解方法是将大型多步任务分解为子任务,要求语言模型为每个子任务生成方案("思想"),并采用深度优先搜索(DFS)等穷举规划方法组合解决方案。本研究基于这一思路提出两项新贡献:首先,通过部分可观测马尔可夫决策过程(POMDPs)形式化基于规划的LM多步问题求解方法,将LM对状态价值的自省作为搜索启发式;其次,利用在线POMDP求解器POMCP,在24点游戏任务中实现89.4%的优异成功率,同时相比既有固定树搜索方法展现出更优的实时性能特性。综合而言,这些贡献使得现代语言模型能够更有效地分解并求解大规模推理任务。