Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.
翻译:标准遗忘评估在训练后立即以全精度测量行为抑制,然而所有已部署的语言模型均需经过量化。近期研究表明,4位后训练量化可逆转机器学习遗忘;我们证明这并非调优伪影,而是系统性双重失效:实现有效遗忘的梯度方法会在压缩后失效,而能抵抗量化的方法则几乎不改变模型。两种失效皆源于同一根本原因:所有基线方法中,逐参数更新量低于NF4量化区间宽度的47-828倍;分散在数十亿参数中的更新无法穿越量化区间边界——我们将此形式化为稀疏-持久性权衡。我们提出MANSU(机制对齐零空间遗忘),通过结合因果电路归因隔离最小遗忘子图、电路受限零空间投影(含对角费歇尔保留约束)以及确保量化生存的逐参数幅度下限,同时解决两种失效模式。此外,我们引入电路归因散度(CAD)这一机制验证度量,用以区分结构性擦除与行为抑制——现有指标无法实现此区分。在多个模型族与危害基准测试中,MANSU是首个在各项指标上均具备余量的方法(有效遗忘、保留保持、非正PTQ间隙、结构性擦除),而基于梯度的基线在压缩后恢复至+0.05精度。