Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability that updates miss rare-correct modes as a function of group size, showing non-monotonic behavior, and characterize how updates redistribute mass within the correct set, revealing that unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware advantage scaling coefficient, inspired by Focal loss, that down-weights updates on high-success prompts. The lightweight modification can be directly integrated into any group-relative RLVR algorithm such as GRPO, DAPO, and CISPO. On Qwen2.5-7B across in-domain and out-of-domain benchmarks, our method improves pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.
翻译:基于可验证奖励的强化学习(RLVR)通常依赖于分组采样来估计优势并稳定策略更新。实践中,由于计算限制,大分组规模不可行,这导致学习偏向于已经高概率的轨迹。较小的分组则常常遗漏罕见正确轨迹,同时仍包含混合奖励,从而将概率集中于常见解上。我们推导了更新遗漏罕见正确模式概率作为分组规模函数的表达式,揭示了其非单调行为,并刻画了更新如何在正确集合内重新分配概率质量,表明未采样的正确质量即使在总正确质量增长时也可能收缩。基于此分析,我们提出了一种受Focal损失启发的难度感知优势缩放系数,该系数对高成功率提示的更新进行降权。这一轻量级修改可直接集成到任何分组相对RLVR算法中,如GRPO、DAPO和CISPO。在Qwen2.5-7B模型上,针对领域内和领域外基准测试,我们的方法将pass@256从64.1提升至70.3(GRPO)、69.3提升至72.5(DAPO)以及73.2提升至76.8(CISPO),同时保持或改进了pass@1指标,且未增加分组规模或计算成本。