Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.
翻译:可验证奖励强化学习已成为增强大语言模型推理能力不可或缺的范式。然而,标准策略优化方法(如组相对策略优化)常收敛于低熵策略,导致严重的模式崩溃和有限的输出多样性。我们从采样概率动态的视角分析此问题,发现标准目标函数会不成比例地强化最高似然路径,从而抑制有效的替代推理链。为解决此问题,我们提出一种新颖的优势重加权机制,旨在均衡所有正确答案之间的置信度水平。通过将提示困惑度与答案置信度融入优势估计,我们的方法动态重塑奖励信号以衰减过度置信推理路径的梯度更新,同时将概率质量重新分配给未被充分探索的正确解。实证结果表明,我们的方法在保持竞争力的准确率的同时,显著提升了生成多样性和响应熵,有效实现了推理任务中探索与利用的优越权衡。在Qwen2.5和DeepSeek模型上进行的数学与编程基准测试显示,ProGRPO显著缓解了熵崩溃现象。具体而言,在Qwen2.5-7B模型上,我们的方法在Pass@1指标上优于GRPO 5.7%,在Pass@32指标上更显著领先13.9%,突显了其在生成多样化正确推理路径方面的卓越能力。