Understanding Diversity Collapse in RLVR via the Lens of Overtraining

Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emph{diversity collapse}: Pass@$1$ improves while high-$k$ Pass@$k$ degrades, which is viewed as a narrowing of the model's reasoning boundary. We formalize this diversity collapse through the lens of \emph{overtraining}: once a problem's contribution to the reference metric has effectively saturated, further updates no longer expand what the model can solve but still concentrate probability mass on the trajectories favored by on-policy sampling. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-$k$ Pass@$k$, so most updates in standard RLVR are overtraining from the boundary perspective. This perspective also suggests a reading of whether RLVR can expand the model's reasoning abilities beyond the base model: since RLVR is structurally biased against high-$k$ Pass@$k$, its aggregate decline does not by itself mean that no new reasoning gains occurred. Interventionally, restricting updates to problems with zero observed success lifts Pass@$256$ above the base model on difficult benchmarks; observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Building on these findings, we propose \emph{Bayesian Boundary Gating} (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@$k$ across a wide range of $k$.

翻译：可验证奖励强化学习（RLVR）已成为提升大语言模型推理能力的关键方法。然而，RLVR 常遭遇*多样性崩溃*：Pass@$1$ 提升的同时，高 $k$ 值的 Pass@$k$ 反而下降，这被视为模型推理边界的收窄。我们通过*过训练*视角来形式化这种多样性崩溃：一旦一个问题对参考指标的贡献已有效饱和，进一步的更新不再扩增模型所能解决的问题，但仍会将概率质量集中到基于策略采样的轨迹上。在每问题少量展开的标准设置下，即便单次成功观测也会使问题进入高 $k$ 值 Pass@$k$ 的近乎饱和区域，因此标准 RLVR 中的大部分更新从边界视角看均属过训练。该视角同时提示：RLVR 能否将模型推理能力扩展至基础模型之上？由于 RLVR 在结构上对高 $k$ 值 Pass@$k$ 存在偏见，其整体下降本身并不意味未产生新的推理增益。干预性实验表明：仅对零成功观测的问题进行限制性更新，可将 Difficult 基准测试上的 Pass@$256$ 提升至基础模型之上；观测性实验发现：在标准 RLVR 训练过程中，原本不可解问题中有相当比例变得可解。基于这些发现，我们提出*贝叶斯边界门控*（BBG），通过估计每个问题对推理边界的边际贡献，将优化从过训练中引导开。在多个推理基准测试中，BBG 在宽广的 $k$ 值范围内提升了平均 Pass@$k$。