Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.
翻译:基于可验证奖励的强化学习已成为提升大语言模型推理能力的标准方法。然而,现有PPO风格的信任域机制仍通过独立地对所有词元施加统一阈值,保持位置无关性。这种逐点处理方式在两方面与自回归生成产生冲突:首先,统一阈值忽略了自回归不对称性。早期偏差会产生累积性序列级漂移,导致静态阈值对早期发散约束不足,同时过度限制后期探索。其次,孤立评估词元级差异会忽视累积前缀漂移,使得无论条件生成历史已偏离部署策略多远,相同差异容许量都会被授予。为解决此局限性,我们提出CPPO(累积前缀差异策略优化)——一种通过两个耦合机制将更新与有限时域策略改进边界对齐的词元级掩码规则。首先,位置加权阈值对持续性更长的早期位置施加更严格限制,同时放松对后期词元的约束。其次,累积前缀预算跟踪历史偏差,动态限制后续词元级偏差以防止前缀上的复合误差。实验表明,CPPO在不同模型规模下均能提升训练稳定性,并显著提高推理准确率。