Supervised fine-tuning (SFT) induces new behaviors in large language models, yet imposes no structural constraint on how these behaviors are distributed within the model. Existing behavior interpretation methods, such as circuit attribution approaches, identify sparse subnetworks correlated with SFT-induced behaviors post-hoc. However, such correlations do not imply *causal necessity*, limiting the ability to selectively control SFT-induced behaviors at inference time. We pursue an alternative by asking: can an SFT-induced behavior be deliberately compressed into a sparse, mechanistically necessary subnetwork, termed a *carrier*, while remaining controllable at inference time without weight modification? We propose (a) **Loss-Constrained Dual Descent (LCDD)**, which constructs such carriers by jointly optimizing routing masks and model weights under an explicit utility budget, and (b) **SFT-Eraser**, a soft prompt optimized via activation matching on extracted carrier channels, to reverse the SFT-induced behavior. Across safety, fixed-response, and style behaviors on multiple model families, LCDD yields sparse carriers that preserve target behaviors while enabling strong reversion when triggered by SFT-Eraser. Ablations further establish that the sparse structure is the key precondition for reversal: the same trigger optimization fails on standard SFT models, confirming that structure rather than trigger design is the operative factor. These results provide direct evidence that the learned carriers are causally necessary for the behaviors, pointing to a new direction for systematically localizing and selectively suppressing SFT-induced behaviors in deployed models.
翻译:监督微调(SFT)会在大语言模型中诱发新行为,但并未对这些行为在模型内的分布施加结构约束。现有行为解释方法(如电路归因方法)事后识别出与SFT诱导行为相关的稀疏子网络,然而此类相关性并不意味着*因果必然性*,限制了在推理时选择性控制SFT诱导行为的能力。我们另辟蹊径:能否将SFT诱导行为有意压缩为一种称为"载体"的稀疏且机制上必然的子网络,同时在不修改权重的情况下于推理阶段保持可控性?本文提出(a)**损失约束对偶下降法(LCDD)**——通过显式效用预算约束下联合优化路由掩码与模型权重来构建此类载体;以及(b)**SFT擦除器(SFT-Eraser)**——一种基于提取的载体通道激活匹配优化的软提示,用于逆转SFT诱导行为。在多个模型系列的安防、固定响应与风格行为实验中,LCDD生成的稀疏载体既能维持目标行为,又能在被SFT擦除器触发时实现强效逆转。消融实验进一步证实稀疏结构是实现逆转的关键前提:相同触发优化方法在标准SFT模型上失效,确认结构而非触发设计才是作用因素。这些结果直接证明学习获得的载体对行为具有因果必然性,为系统定位与选择性抑制部署模型中的SFT诱导行为指明了新方向。