Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as $(1-p)$ where $p$ is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.
翻译:大语言模型(LLM)的强化学习面临一个根本性矛盾:高吞吐量的推理引擎与数值精确的训练系统会从相同的参数产生不同的概率分布,从而造成训练-推理失配。我们证明这种失配具有非对称效应:对数概率失配的界按 $(1-p)$ 缩放,其中 $p$ 为词元概率。对于高概率词元,该界趋近于零,对序列级失配的贡献可忽略不计。而对于长尾分布中的低概率词元,该界仍保持较大值;更重要的是,当这些词元被采样时,会呈现系统性的有偏失配,并在序列中不断累积,从而破坏梯度估计的稳定性。不同于采用事后修正策略,我们提出将强化学习目标约束在一个动态剪枝的“安全”词汇表中,该词汇表排除了极端长尾词元。通过剪除此类词元,我们将大规模、系统性有偏的失配转化为一种有界的小幅优化偏差。实验表明,我们的方法能够实现稳定的训练;理论上,我们界定了词汇剪枝所引入的优化偏差。