STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. Our analysis shows that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. To mitigate this instability, we design S2T (silencing spurious tokens) mechanism to efficiently identify spurious tokens through characteristic signals with low probability, low entropy, and positive advantage, and then to suppress their gradient perturbations during optimization. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.69\% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy and JustRL.

翻译：强化学习（RL）显著提升了大语言模型的推理能力，但现有的RL微调方法严重依赖启发式技术（如熵正则化和重加权）以维持稳定性。实践中，这些方法常遭受后期性能崩溃，导致推理质量下降和训练不稳定。我们的分析表明，RL中标记级策略梯度的大小与标记概率及局部策略熵呈负相关。我们发现训练不稳定性可能由极小部分（约0.01%）的标记引起，我们称之为**伪标记**。当此类标记出现在正确响应中时，它们对推理结果的贡献微乎其微，却继承了完整的序列级奖励，导致梯度更新异常放大。为缓解这种不稳定性，我们设计了S2T（抑制伪标记）机制，通过低概率、低熵和正优势值的特征信号高效识别伪标记，进而在优化过程中抑制其梯度扰动。将该机制融入基于分组的优化目标，我们提出了**伪标记感知策略优化（STAPO）**，以促进稳定且有效的大规模模型精调。在使用Qwen 1.7B、8B和14B基础模型的六个数学推理基准测试中，STAPO始终展现出更优的熵稳定性，相比GRPO、20-Entropy和JustRL，平均性能分别提升7.13%（$ρ_{\mathrm{T}}$=1.0, top-p=1.0）和3.69%（$ρ_{\mathrm{T}}$=0.7, top-p=0.9）。