Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.

翻译：标准遗忘评估在训练后立即以全精度测量行为抑制，然而所有已部署的语言模型均需经过量化。近期研究表明，4位后训练量化可逆转机器学习遗忘；我们证明这并非调优伪影，而是系统性双重失效：实现有效遗忘的梯度方法会在压缩后失效，而能抵抗量化的方法则几乎不改变模型。两种失效皆源于同一根本原因：所有基线方法中，逐参数更新量低于NF4量化区间宽度的47-828倍；分散在数十亿参数中的更新无法穿越量化区间边界——我们将此形式化为稀疏-持久性权衡。我们提出MANSU（机制对齐零空间遗忘），通过结合因果电路归因隔离最小遗忘子图、电路受限零空间投影（含对角费歇尔保留约束）以及确保量化生存的逐参数幅度下限，同时解决两种失效模式。此外，我们引入电路归因散度（CAD）这一机制验证度量，用以区分结构性擦除与行为抑制——现有指标无法实现此区分。在多个模型族与危害基准测试中，MANSU是首个在各项指标上均具备余量的方法（有效遗忘、保留保持、非正PTQ间隙、结构性擦除），而基于梯度的基线在压缩后恢复至+0.05精度。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

综述：多模态遗忘方法、数据集与基准

专知会员服务

17+阅读 · 7月10日

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

专知会员服务

21+阅读 · 2025年8月9日

大模型如何遗忘不良知识？最新《生成式人工智能中的机器遗忘》综述

专知会员服务

25+阅读 · 2024年8月1日