While numerous machine unlearning (MU) methods have recently been developed with promising results in erasing the influence of forgotten data, classes, or concepts, they are also highly vulnerable-for example, simple fine-tuning can inadvertently reintroduce erased concepts. In this paper, we address this contradiction by examining the internal representations of unlearned models, in contrast to prior work that focuses primarily on output-level behavior. Our analysis shows that many state-of-the-art MU methods appear successful mainly due to a misalignment between last-layer features and the classifier, a phenomenon we call feature-classifier misalignment. In fact, hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Assuming neural collapse in the original model, we further demonstrate that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, and we corroborate this with experiments using classifier-only fine-tuning. Motivated by these findings, we propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show that CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy, highlighting the need for faithful representation-level evaluation of MU.
翻译:摘要:尽管近期涌现出众多机器遗忘(MU)方法,在消除遗忘数据、类别或概念的影响方面取得了令人瞩目的成果,但它们也高度脆弱——例如,简单的微调可能会无意中重新引入已消除的概念。本文通过聚焦已遗忘模型的内部表征(与以往主要关注输出层面行为的研究形成对比)来破解这一矛盾。我们的分析表明,许多最先进的MU方法之所以看似成功,主要源于最后一层特征与分类器之间的错配,我们将此现象称为特征-分类器错配。事实上,隐藏特征仍具有高度判别性,简单的线性探针即可恢复近乎原始的准确率。基于原始模型中的神经坍缩假设,我们进一步证明:仅调整分类器即可在维持保留准确率的同时实现可忽略的遗忘准确率,并通过仅分类器微调实验验证了这一结论。受这些发现启发,我们提出基于类均值特征(CMF)分类器的MU方法,该方法显式强制特征与分类器之间的对齐。标准基准实验表明,基于CMF的遗忘可在降低表征中遗忘信息的同时维持高保留准确率,凸显了忠实评估MU表征层面的必要性。