We revisit exploration collapse in reinforcement learning with verifiable rewards (RLVR), from the perspective of the \emph{candidate distribution} for next-token prediction. We formally show that as probability concentrates on the top-$1$ candidate, the expected number of distinct responses collapses to one regardless of the sampling budget $K$. This theoretical implication is further verified by our empirical tracking of top-$N$ candidate probabilities during training, where the top-$1$ candidate progressively dominates while plausible alternatives are suppressed. These findings suggest a key desideratum for effective exploration: \emph{preserving non-negligible probability mass on the top-$N$ candidates}. To this end, we propose Candidate-aware Support Preservation (CaSP), with two complementary designs. Specifically, CaSP redistributes positive gradients among top-$N$ candidates for correct responses, and applies a stronger penalty to the top-$1$ candidate for incorrect responses. Unlike many exploration-oriented methods that improve pass@$K$ at the cost of pass@1, CaSP improves pass@$K$ across the full $K$ spectrum. These gains generalize to 6 math, 2 logical-reasoning, and 2 coding benchmarks, and scales to 32B-parameter models and sampling budgets up to $K=1024$, positioning it as a principled, candidate-level approach for RLVR exploration.
翻译:我们围绕可验证奖励强化学习(RLVR)中的探索崩溃问题,从下一令牌预测的“候选分布”视角重新审视。我们严格证明:当概率集中于排名第一的候选令牌时,无论采样预算 K 取何值,不同回答的期望数量均坍缩为1。这一理论推断通过训练过程中对前 N 个候选令牌概率的实证追踪得到验证——排名第一的候选令牌概率逐渐占据主导地位,而其他可行替代被抑制。这些发现揭示了有效探索的关键要求:保持前 N 个候选令牌上不可忽略的概率质量。为此,我们提出了候选感知支持保持(CaSP)方法,包含两项互补设计:对正确回答,CaSP 在前 N 个候选令牌间重新分配正梯度;对错误回答,则对其第一候选令牌施加更强惩罚。与诸多以牺牲 pass@1 为代价提升 pass@K 的面向探索方法不同,CaSP 在完整 K 值范围内均能提升 pass@K。这些增益泛化至6个数学、2个逻辑推理及2个编程基准测试,且可扩展至32B参数模型及最高 K=1024 的采样预算,从而成为 RLVR 探索中一种基于候选令牌层面的原理性方法。