Post-training, particularly reinforcement learning (RL) using self-play-generated data, has become a new learning paradigm for large language models (LLMs). However, scaling RL to develop a general reasoner remains a research challenge, as existing methods focus on task-specific reasoning without adequately addressing generalization across a broader range of tasks. Moreover, unlike traditional RL with limited action space, LLMs operate in an infinite space, making it crucial to search for valuable and diverse strategies to solve problems effectively. To address this, we propose searching within the action space on high-level abstract plans to enhance model generalization and introduce Critical Plan Step Learning (CPL), comprising: 1) searching on plan, using Monte Carlo Tree Search (MCTS) to explore diverse plan steps in multi-step reasoning tasks, and 2) learning critical plan steps through Step-level Advantage Preference Optimization (Step-APO), which integrates advantage estimates for step preference obtained via MCTS into Direct Preference Optimization (DPO). This combination helps the model effectively learn critical plan steps, enhancing both reasoning capabilities and generalization. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as HumanEval (+12.2%), GPQA (+8.6%), ARC-C (+4.0%), MMLU-STEM (+2.2%), and BBH (+1.8%).
翻译:后训练,特别是利用自对弈生成数据进行强化学习(RL),已成为大语言模型(LLM)的一种新型学习范式。然而,将RL扩展以开发通用推理器仍是一个研究挑战,因为现有方法侧重于特定任务的推理,未能充分解决跨更广泛任务的泛化问题。此外,与传统RL具有有限动作空间不同,LLM在无限空间中运行,这使得搜索有价值且多样化的策略以有效解决问题变得至关重要。为解决此问题,我们提出在高层抽象规划的动作空间内进行搜索以增强模型泛化能力,并引入关键规划步骤学习(CPL),其包含:1)规划搜索:使用蒙特卡洛树搜索(MCTS)在多步推理任务中探索多样化的规划步骤;2)通过步骤级优势偏好优化(Step-APO)学习关键规划步骤,该方法将通过MCTS获得的步骤偏好优势估计整合到直接偏好优化(DPO)中。这种结合帮助模型有效学习关键规划步骤,从而提升推理能力和泛化性能。实验结果表明,我们的方法仅在GSM8K和MATH数据集上进行训练,不仅显著提升了在GSM8K(+10.5%)和MATH(+6.5%)上的性能,还增强了在领域外推理基准测试上的表现,例如HumanEval(+12.2%)、GPQA(+8.6%)、ARC-C(+4.0%)、MMLU-STEM(+2.2%)和BBH(+1.8%)。