Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to "training inference mismatch stemming" from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model's optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
翻译:强化学习在训练大语言模型时具有众所周知的不稳定性。尽管近期研究将其归因于混合引擎不一致导致的“训练-推理失配”,但标准修正方法(如重要性采样)在长期训练过程中可能失效。本研究从优化视角分析该不稳定性,证明梯度噪声与训练-推理失配会随训练进程同步加剧。同时,我们发现通过缩小更新规模可有效抑制失配现象。综合而言,我们推断失配不仅是静态数值差异,更是与模型优化过程耦合的动态失效。基于此洞见,我们提出一种简单有效的解决方案:专用学习率调度器。与传统学习率调度器采用预定义衰减计划不同,本方法根据响应长度动态触发学习率衰减——我们将其识别为即将发生不稳定性的可靠预警信号。实证证据表明,通过随梯度噪声增加而降低学习率,我们能够持续稳定强化学习训练,并将训练-推理失配维持在安全水平。