Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
翻译:从大语言模型中实现选择性知识擦除对于GDPR合规性和模型安全至关重要,然而当前的遗忘方法将行为抑制与真实知识移除混为一谈,导致潜在能力在表面拒绝之下持续存在。本研究通过引入知识免疫框架(KIF)应对这一挑战,该框架是一种表示感知架构,通过针对内部激活签名而非表面输出来区分真实擦除与混淆。我们的方法结合了针对特定主题表示的动态抑制与参数高效适配,实现了无需完整模型重训练的持久遗忘。KIF在保持效用达到基准水平(MU = 0.62)的同时,实现了接近理想的擦除效果(FQ ≈ 0.99 vs. 1.00),有效突破了先前所有工作所面临的稳定性-擦除权衡困境。我们在3B至14B参数规模上评估了标准基础模型(Llama和Mistral)与推理优先模型(Qwen和DeepSeek)。观察发现,标准模型表现出与规模无关的真实擦除特性(效用漂移<3%),而推理优先模型则揭示了根本性的架构差异。我们提出的综合双指标评估协议,结合表面泄漏与潜在痕迹持续性,将混淆-擦除区分操作化,首次实现了跨模型家族与规模的机制层面遗忘行为的系统性诊断。