Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs tuned with $λ$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.
翻译:过程奖励模型(PRMs)允许在强化学习(RL)中进行细粒度的信用分配,这似乎与对整个轨迹分配单一奖励的结果奖励模型(ORMs)形成对比。然而,我们在这项工作中提供了理论证明:配备 ORM 的组相对策略优化(GRPO)强化学习算法,实际上等价于配备一个非平凡的、基于蒙特卡洛的 PRM(在温和假设下)的 PRM 感知强化学习目标。利用 GRPO 作为 PRM 的框架,我们识别出 GRPO 目标中的一个缺陷,该缺陷与不平衡的过程步骤和奖励相互作用,从而阻碍了探索和利用(在不同条件下)。我们提出了一种简单的算法修改来缓解这一缺陷($λ$-GRPO),并证明使用 $λ$-GRPO 调优的大语言模型(LLMs)在下游推理任务上优于使用标准 GRPO 调优的大语言模型,并且能更快地达到峰值性能。这些结果表明,我们可以利用原始 GRPO 算法中隐藏的、内置的 PRM 结构来提升模型性能,而无需使用显式的 PRM,并且对训练时间和成本的影响微乎其微。