Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training
翻译:大型语言模型的后训练通常交替进行监督微调与强化学习。这两种方法具有不同的目标:监督微调旨在最小化模型输出与专家响应之间的交叉熵损失,而强化学习则致力于最大化源自人类偏好或基于规则的验证器的奖励信号。现代推理模型已广泛采用交替进行监督微调与强化学习训练的做法。然而,关于二者是否能够解耦,目前尚缺乏理论解释。我们证明,无论以何种顺序尝试解耦均不可行:(1) 监督微调后接强化学习的耦合:在监督微调最优性条件下,强化学习会增加监督微调的损失;(2) 强化学习后接监督微调的耦合:监督微调会降低强化学习所获得的奖励。在Qwen3-0.6B上的实验证实了预测的性能退化,验证了在后训练过程中若将监督微调与强化学习分离,则无法避免先前性能的损失。