Crafting Reversible SFT Behaviors in Large Language Models

Supervised fine-tuning (SFT) induces new behaviors in large language models, yet imposes no structural constraint on how these behaviors are distributed within the model. Existing behavior interpretation methods, such as circuit attribution approaches, identify sparse subnetworks correlated with SFT-induced behaviors post-hoc. However, such correlations do not imply *causal necessity*, limiting the ability to selectively control SFT-induced behaviors at inference time. We pursue an alternative by asking: can an SFT-induced behavior be deliberately compressed into a sparse, mechanistically necessary subnetwork, termed a *carrier*, while remaining controllable at inference time without weight modification? We propose (a) **Loss-Constrained Dual Descent (LCDD)**, which constructs such carriers by jointly optimizing routing masks and model weights under an explicit utility budget, and (b) **SFT-Eraser**, a soft prompt optimized via activation matching on extracted carrier channels, to reverse the SFT-induced behavior. Across safety, fixed-response, and style behaviors on multiple model families, LCDD yields sparse carriers that preserve target behaviors while enabling strong reversion when triggered by SFT-Eraser. Ablations further establish that the sparse structure is the key precondition for reversal: the same trigger optimization fails on standard SFT models, confirming that structure rather than trigger design is the operative factor. These results provide direct evidence that the learned carriers are causally necessary for the behaviors, pointing to a new direction for systematically localizing and selectively suppressing SFT-induced behaviors in deployed models.

翻译：监督微调（SFT）会在大语言模型中诱发新行为，但并未对这些行为在模型内的分布施加结构约束。现有行为解释方法（如电路归因方法）事后识别出与SFT诱导行为相关的稀疏子网络，然而此类相关性并不意味着*因果必然性*，限制了在推理时选择性控制SFT诱导行为的能力。我们另辟蹊径：能否将SFT诱导行为有意压缩为一种称为"载体"的稀疏且机制上必然的子网络，同时在不修改权重的情况下于推理阶段保持可控性？本文提出(a)**损失约束对偶下降法（LCDD）**——通过显式效用预算约束下联合优化路由掩码与模型权重来构建此类载体；以及(b)**SFT擦除器（SFT-Eraser）**——一种基于提取的载体通道激活匹配优化的软提示，用于逆转SFT诱导行为。在多个模型系列的安防、固定响应与风格行为实验中，LCDD生成的稀疏载体既能维持目标行为，又能在被SFT擦除器触发时实现强效逆转。消融实验进一步证实稀疏结构是实现逆转的关键前提：相同触发优化方法在标准SFT模型上失效，确认结构而非触发设计才是作用因素。这些结果直接证明学习获得的载体对行为具有因果必然性，为系统定位与选择性抑制部署模型中的SFT诱导行为指明了新方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【伯克利博士论文】基于投机性解码的高效大语言模型系统

专知会员服务

16+阅读 · 1月4日

面向大型语言模型推理的可信研究综述

专知会员服务

22+阅读 · 2025年9月6日

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

可解释人工智能中的大语言模型：全面综述

专知会员服务

54+阅读 · 2025年4月2日