Policy gradient methods for Large Language Models optimize a policy $π_θ$ via a surrogate objective computed from samples of a rollout policy $π_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch ($π_{\text{roll}} \neq π_θ$) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound ($O(T^{3/2})$), a Mixed bound ($O(T)$), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields the tightest known guarantee across all divergence regimes. Crucially, all bounds depend on the maximum token-level divergence $D_{\mathrm{KL}}^{\mathrm{tok,max}}$ (or $D_{\mathrm{TV}}^{\mathrm{tok,max}}$), a sequence-level quantity that cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences violating the trust region, enabling the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.
翻译:针对大语言模型的策略梯度方法通过由滚动策略π_roll采样计算的代理目标来优化策略π_θ。然而,现代LLM-RL流程存在不可避免的实现偏差——后端差异、专家混合路由不连续性及分布式训练滞后——导致离策略失配(π_roll ≠ π_θ)以及代理目标与真实目标间的近似误差。我们证明该误差的经典信任区域边界随序列长度T呈O(T^2)缩放,使其对长视野任务失去实际意义。为此,我们推导出一系列边界——包括基于KL散度与全变差的边界——涵盖Pinsker-边际边界(O(T^{3/2))、混合边界(O(T)),以及通过逐位置重要性比率分解严格推广Pinsker-边际边界自适应边界。取所有边界的最小值可获得跨所有散度区间的最紧已知保证。关键的是,所有边界均依赖于最大词元级散度D_KL^tok,max(或D_TV^tok,max),这是一个序列级量值,无法通过PPO裁剪等词元无关方法进行控制。我们提出信任区域掩码方法,该方法对违反信任区域的完整序列进行掩码处理,首次为长视野LLM-RL实现了具有实际意义的单调改进保证。