Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.
翻译:微调大型语言模型(LLM)常会引发先前能力的灾难性遗忘。近期研究表明,强化学习(RL)比监督微调(SFT)更有效地保留先前能力,归因于策略梯度更新更贴近基础策略 \cite{shenfeld2025rl}。我们将此行为学解释延伸至机制层面,探究RL的优势是否体现在内部计算电路更强的保留性上。我们引入差分电路脆弱性——一种衡量微调过程中电路退化程度的头部级指标,并用其比较RL与SFT在适配科学问答任务的Qwen2.5-3B-Instruct模型上的表现。我们发现明确的机制权衡:SFT对目标任务的适应更快,但导致显著更强的电路破坏和先前能力遗忘;而RL保留了更大比例的基础电路,代价是任务适应速度较慢。这些发现表明,电路保留或可解释RL对灾难性遗忘更具鲁棒性的原因。我们已开源代码:https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability。