Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.
翻译:让大型语言模型(LLM)在保留通用能力的前提下彻底遗忘特定知识与价值观,仍是模型遗忘领域的核心挑战。然而,当前方法极易通过微调或少样本提示被逆转,表明其遗忘仅停留于浅层。我们揭示了这一现象的根本原因:现有方法所针对的表示与保留集共享,且与微调攻击者恢复的子空间存在重叠,导致遗忘既损害通用能力又易于被逆转。为此,我们提出RepSelect(表示选择性),通过在每个权重更新步骤前压缩梯度主成分,隔离专属于遗忘集的表示,从而在保持通用能力不受影响的同时限制微调可恢复的信息。我们在两个遗忘类别(生物危害知识与攻击倾向)以及四种覆盖密集参数与混合专家架构的模型系列(Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite)上进行了评估。与五种主流基线方法(GradDiff、NPO、SimNPO、RMU、UNDIAL)相比,RepSelect在重新学习后的答案准确率下降幅度上比最强基线高出4-50倍,并对少样本提示攻击展现出近乎完美的鲁棒性。因此,针对选择性表示的遗忘策略是实现深度且鲁棒的LLM遗忘的重要步骤。