Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.
翻译:基于可验证奖励的强化学习(RLVR)提升了大型语言模型的推理能力,但越来越多的证据表明其存在探索瓶颈:该方法往往只是对现有解迹进行重加权,而非发现新的策略,从而限制了在大采样预算(例如pass-at-256)下的性能提升。我们通过PSN-RLVR来解决这一局限,该方法在轨迹生成前扰动策略参数,以诱导时间一致、轨迹层面的探索,相比动作空间噪声,能更好地保持长跨度思维链的连贯性。为了缓解由此产生的采样-更新失配问题,我们引入了截断重要性采样(TIS)。为了避免基于KL散度的自适应噪声控制带来的高昂计算成本,我们提出了一种计算高效的自适应噪声实时调度器,该调度器由一个轻量级代理驱动,该代理结合了语义多样性和归一化自确定性。在广泛使用的RLVR方法GRPO上进行实例化后,PSN-GRPO在多个数学推理基准测试和模型系列中,持续扩展了有效的推理能力边界,在大采样预算下实现了更高的pass-at-k性能,并优于先前以探索为导向的RLVR方法(例如Pass-at-k风格训练),同时保持正交性,因此可与其他方法组合以获得额外增益。