Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy's accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford's online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.
翻译:带有可验证奖励的强化学习(RLVR)已被证明能有效增强大型语言模型(LLM)的推理能力。然而,主流方法如组相对策略优化(GRPO)面临关键的稳定性挑战:在计算受限(小组规模较小)的情况下,其估计器方差较高;在饱和失效机制(所有响应均产生相同的零奖励)中,梯度信号会消失。为解决这些问题,我们提出了经验贝叶斯策略优化(EBPO),这是一种新颖的框架,通过借用策略累积的全局统计量来正则化基于局部组的基线。EBPO并未孤立地估计基线,而是采用一种收缩估计器,动态平衡局部组统计量与通过Welford在线算法更新的全局先验。理论上,我们证明与GRPO相比,EBPO能保证严格更低的均方误差(MSE)、有界的熵衰减,并在失效场景中提供非消失的惩罚信号。实证上,EBPO在包括AIME和OlympiadBench在内的多种基准测试中,始终优于GRPO及其他已建立的基线方法。值得注意的是,EBPO展现出卓越的训练稳定性,即使在小规模组设置下也能实现高性能提升,并能显著受益于难度分层的课程学习。