Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
翻译:可验证奖励的强化学习(RLVR)推动了大型语言模型推理的发展,但仍受限于有限展开预算下的低效探索,导致在复杂任务中采样成功率低且训练不稳定。我们发现,许多探索失败并非源于问题难度,而是由少数引入干扰的提示词元所致。基于这一洞见,我们提出了更少噪声采样框架(LENS),该框架首先通过识别并移除干扰词元进行提示净化,随后将净化过程中成功的展开结果迁移至原始含噪提示上,以监督策略优化,使模型学会在现实世界含噪提示设置中忽略干扰。实验结果表明,LENS显著优于GRPO,实现了更高的性能和更快的收敛速度,平均增益达3.88%,加速比超过1.6倍。我们的工作凸显了剪枝干扰词元在提升展开效率中的关键作用,为RLVR研究提供了新的视角。