Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at https://github.com/layer6ai-labs/wapo.
翻译:强化学习与可验证奖励(RLVR)可提升语言模型的推理能力,但GRPO风格的优化仍容易陷入崩溃。本文通过token级梯度动力学分析这一不稳定性,建立了一种分类框架,用于预测更新如何影响下一token概率和熵。该分类表明,稳定性同时取决于优势信号符号和当前策略下的token分布。基于这一发现,我们提出胜者优势策略优化(WAPO)——一种简单的在线裁剪策略梯度目标函数,仅对优势为正的完整序列进行更新。在数学推理和多跳问答基准测试中,WAPO提升了训练稳定性,并在多个模型族上达到或超越基线方法的性能。完整代码可在https://github.com/layer6ai-labs/wapo获取。