Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.
翻译:基于可验证奖励的强化学习(RLVR)在推理任务中的单次尝试准确率(Pass@1)上取得了显著提升,但往往导致多样本覆盖率(Pass@K)下降,表明存在多样性崩溃现象。我们识别出这种性能退化的结构性原因:常见的RLVR目标函数(如GRPO)对正确解之间概率质量的分布方式漠不关心。结合随机训练动力学,这种漠视引发自我强化的崩溃过程:概率质量集中于狭窄的正确输出子集,而其他有效解则受到抑制。我们正式化这种崩溃机制,并进一步刻画了两种互补准则下的最优策略结构:鲁棒性和熵正则化最优性,两者均将均匀正确策略(Uniform-Correct Policy)识别为唯一最优策略。基于此分析,我们提出均匀正确策略优化(UCPO),该算法在GRPO基础上增加一项条件均匀性惩罚,作用于策略在正确解上的分布。该惩罚将梯度信号重新分配至未被充分表示的正确响应,促进概率质量在正确解集内的均匀分配。在三种模型(1.5B-7B参数)和五个数学推理基准测试中,UCPO在保持具有竞争力的Pass@1的同时提升了Pass@K和多样性,在AIME24的Pass@64上实现了高达+10%的绝对提升,并在正确解集内实现了高达45%的方程级多样性增长。代码已开源至https://github.com/AnamikaLochab/UCPO。