Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.
翻译:过程奖励在深度强化学习中被广泛用于提高训练效率、降低方差并防止奖励破解。在LLM推理领域,现有研究也在探索在有或无专家策略辅助下学习有效过程奖励模型的各种方案。然而,现有方法要么依赖于关于专家策略的强假设,要么存在内在局限性,导致学得的PRM性能较弱或泛化能力有限。本文提出rePIRL,这是一个受逆向强化学习启发的框架,能够在关于专家策略的假设最少的条件下学习有效的PRM。具体而言,我们设计了一种双重学习过程,交替更新策略与PRM。我们的学习算法采用了定制化技术,以应对将传统逆向强化学习扩展到LLM所面临的挑战。我们从理论上证明了所提出的学习框架能够统一在线与离线PRM学习方法,这论证了rePIRL能够在最小假设下学习PRM。在标准化数学与代码推理数据集上的实证评估表明,rePIRL相较于现有方法具有显著优势。我们进一步展示了训练所得的PRM在测试时训练、测试时缩放以及为困难问题训练提供早期信号等方面的应用。最后,我们通过详细的消融研究验证了我们的训练方案及关键设计选择。