Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.
翻译:基于验证器奖励的强化学习已成为在推理任务上对大语言模型进行后训练的一种广泛应用方法,其中以GRPO及其变体为代表的群体方法获得了广泛采用。这些方法依赖群体相对优势估计来避免学习批评器,但其理论性质仍鲜为人知。本工作中,我们揭示了基于群体的强化学习的一个根本问题:群体相对优势估计量相对于真实(期望)优势存在固有偏差。我们首次通过理论分析证明,该方法会系统性地低估困难提示的优势,同时高估简单提示的优势,导致探索与利用的失衡。为解决此问题,我们提出了历史感知自适应难度加权方法,这是一种基于动态难度锚点和训练过程调整优势估计的自适应重加权方案。在五个数学推理基准测试中,理论分析和实验均表明,当HA-DW集成到GRPO及其变体中时,能持续提升性能。我们的结果表明,纠正有偏差的优势估计对于实现稳健高效的RLVR训练至关重要。