Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Policy gradient methods for Large Language Models optimize a policy $π_θ$ via a surrogate objective computed from samples of a rollout policy $π_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences, such as backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness. These factors cause an off-policy mismatch ($π_{\text{roll}} \neq π_θ$), leading to approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive two new bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. We further derive an Adaptive bound that strictly generalizes the Pinsker-Marginal bound by combining an importance-ratio decomposition of the error with an adaptive per-position application of Pinsker's inequality on the future trajectory divergence; the minimum over all three bounds is tighter than any individual bound. Crucially, all bounds depend on $D_{\mathrm{KL}}^{\mathrm{tok,max}}$, the maximum token-level KL divergence across the sequence. As a \emph{sequence-level} term, the divergence cannot be controlled by previous token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences that violate the trust region. TRM enables the first non-vacuous monotonic improvement guarantees and demonstrates empirical training stability for long-horizon LLM-RL.

翻译：针对大语言模型的策略梯度方法通过从滚出策略 $π_{\text{roll}}$ 的样本计算出的代理目标来优化策略 $π_θ$。然而，现代 LLM-RL 流程存在不可避免的实现差异，例如后端差异、专家混合路由不连续性以及分布式训练陈旧性。这些因素导致离策略失配（$π_{\text{roll}} \neq π_θ$），从而在代理目标与真实目标之间产生近似误差。我们证明，关于此误差的经典信任区域边界随序列长度 $T$ 按 $O(T^2)$ 缩放，使其对于长视野任务失去意义。为解决此问题，我们推导了两个新边界：按 $O(T^{3/2})$ 缩放的 Pinsker-边际边界和按 $O(T)$ 缩放的混合边界。我们进一步推导了一个自适应边界，该边界通过结合误差的重要性比率分解与对未来轨迹散度的自适应逐位置应用 Pinsker 不等式，严格推广了 Pinsker-边际边界；这三个边界的最小值比任何单个边界都更紧。至关重要的是，所有边界都依赖于 $D_{\mathrm{KL}}^{\mathrm{tok,max}}$，即序列中跨令牌的最大 KL 散度。作为一个\emph{序列级}项，该散度无法通过先前如 PPO 裁剪等令牌无关方法进行控制。我们提出了信任区域掩码（TRM），该方法会掩码违反信任区域的整个序列。TRM 首次实现了非平凡单调改进保证，并展示了长视野 LLM-RL 的经验训练稳定性。