Policy gradient methods for Large Language Models optimize a policy $π_θ$ via a surrogate objective computed from samples of a rollout policy $π_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch ($π_{\text{roll}} \neq π_θ$) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound ($O(T^{3/2})$), a Mixed bound ($O(T)$), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields the tightest known guarantee across all divergence regimes. Crucially, all bounds depend on the maximum token-level divergence $D_{\mathrm{KL}}^{\mathrm{tok,max}}$ (or $D_{\mathrm{TV}}^{\mathrm{tok,max}}$), a sequence-level quantity that cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences violating the trust region, enabling the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.
翻译:针对大语言模型的策略梯度方法通过从采样策略$π_{\text{roll}}$生成的样本计算替代目标函数来优化策略$π_θ$。然而,现代LLM-RL流程存在不可避免的实现差异——后端差异、专家混合路由的不连续性以及分布式训练的陈旧性——导致离策略失配($π_{\text{roll}} \neq π_θ$)以及替代目标与真实目标之间的近似误差。我们证明,关于此误差的经典信任区域边界随序列长度$T$呈$O(T^2)$缩放,使其对长序列任务失去实际意义。为解决此问题,我们推导出一系列边界——包括基于KL散度的和基于全变差距离的——具体包括Pinsker-边际边界($O(T^{3/2})$)、混合边界($O(T)$)以及通过逐位置重要性比率分解严格推广Pinsker-边际边界的自适应边界。取所有边界的最小值可在所有散度区间获得目前已知的最紧保证。关键的是,所有边界均依赖于最大词元级散度$D_{\mathrm{KL}}^{\mathrm{tok,max}}$(或$D_{\mathrm{TV}}^{\mathrm{tok,max}}$),这是一个序列级量值,无法通过像PPO裁剪这类词元独立方法进行控制。我们提出信任区域掩码方法,该方法对违反信任区域的整个序列进行掩码处理,从而首次为长序列LLM-RL提供了非平凡且单调改进的性能保证。