Large language models (LLMs) typically employ sampling or beam search, accompanied by prompts such as Chain-of-Thought (CoT), to boost reasoning and decoding ability. Recent work like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by utilizing tree-search algorithms to guide multi-step reasoning. These methods mainly focus on LLMs' reasoning ability during inference and heavily rely on human-designed prompts to activate LLM as a value function, which lacks general applicability and scalability. To address these limitations, we present an AlphaZero-like tree-search framework for LLMs (termed TS-LLM), systematically illustrating how tree-search with a learned value function can guide LLMs' decoding ability. TS-LLM distinguishes itself in two key ways: (1) Leveraging a learned value function, our approach can be generally applied to different tasks beyond reasoning (such as RLHF alignment), and LLMs of any size, without prompting advanced, large-scale models. (2) It can guide LLM's decoding during both inference and training. Empirical evaluations across reasoning, planning, and RLHF alignment tasks validate the effectiveness of TS-LLM, even on trees with a depth of 64.
翻译:大语言模型通常采用采样或波束搜索,并结合链式思维提示等方法以增强推理与解码能力。近期如"思维树"和"通过规划推理"等研究,试图通过树搜索算法引导多步推理来提升大语言模型的推理性能。但这些方法主要聚焦于推理阶段的模型能力,且高度依赖人工设计的提示词来激活大语言模型作为价值函数,缺乏通用性与可扩展性。针对上述局限,我们提出名为TS-LLM的类AlphaZero树搜索框架,系统论证了带学习价值函数的树搜索如何引导大语言模型的解码能力。TS-LLM具有两大关键特性:(1)通过使用学习得到的价值函数,本方法可普遍适用于推理之外的各类任务(如RLHF对齐),且能适配任意规模的大语言模型,无需借助高级大模型提示;(2)既能指导模型推理阶段的解码,也能影响训练阶段的解码过程。在推理、规划及RLHF对齐任务上的实证评估表明,即使面对深度达64的搜索树,TS-LLM仍能保持有效性能。