Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.
翻译:通过后训练使语言模型适应新任务存在降低现有能力的风险——这一现象在经典理论中被称为灾难性遗忘。为制定缓解该现象的指导原则,本文系统比较了两种广泛采用的后训练方法——监督微调与强化学习——的遗忘模式。实验在多种语言模型系列(Llama、Qwen)和任务类型(指令遵循、通用知识与算术推理)中揭示了一致趋势:强化学习在实现相当或更高目标任务性能的同时,比监督微调产生更少的遗忘。为探究这种差异的成因,我们考虑一个简化场景:将语言模型建模为两个分布的混合,分别对应先验知识和目标任务。研究发现,强化学习因其使用在线策略数据而产生的模式寻求特性,能够在学习目标任务时保持先验知识的完整性。随后通过实证验证了这一洞见:在实际场景中,强化学习对遗忘的鲁棒性主要源于在线策略数据的使用,而非KL正则化或优势估计等其他算法选择。最后,作为实际应用启示,我们的研究结果表明利用近似在线策略数据缓解遗忘具有巨大潜力,这类数据的获取效率可能远高于完全在线策略数据。