An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations

While numerous machine unlearning (MU) methods have recently been developed with promising results in erasing the influence of forgotten data, classes, or concepts, they are also highly vulnerable-for example, simple fine-tuning can inadvertently reintroduce erased concepts. In this paper, we address this contradiction by examining the internal representations of unlearned models, in contrast to prior work that focuses primarily on output-level behavior. Our analysis shows that many state-of-the-art MU methods appear successful mainly due to a misalignment between last-layer features and the classifier, a phenomenon we call feature-classifier misalignment. In fact, hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Assuming neural collapse in the original model, we further demonstrate that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, and we corroborate this with experiments using classifier-only fine-tuning. Motivated by these findings, we propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show that CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy, highlighting the need for faithful representation-level evaluation of MU.

翻译：摘要：尽管近期涌现出众多机器遗忘（MU）方法，在消除遗忘数据、类别或概念的影响方面取得了令人瞩目的成果，但它们也高度脆弱——例如，简单的微调可能会无意中重新引入已消除的概念。本文通过聚焦已遗忘模型的内部表征（与以往主要关注输出层面行为的研究形成对比）来破解这一矛盾。我们的分析表明，许多最先进的MU方法之所以看似成功，主要源于最后一层特征与分类器之间的错配，我们将此现象称为特征-分类器错配。事实上，隐藏特征仍具有高度判别性，简单的线性探针即可恢复近乎原始的准确率。基于原始模型中的神经坍缩假设，我们进一步证明：仅调整分类器即可在维持保留准确率的同时实现可忽略的遗忘准确率，并通过仅分类器微调实验验证了这一结论。受这些发现启发，我们提出基于类均值特征（CMF）分类器的MU方法，该方法显式强制特征与分类器之间的对齐。标准基准实验表明，基于CMF的遗忘可在降低表征中遗忘信息的同时维持高保留准确率，凸显了忠实评估MU表征层面的必要性。

相关内容

分类器

关注 6

分类是数据挖掘的一种非常重要的方法。分类的概念是在已有数据的基础上学会一个分类函数或构造出一个分类模型（即我们通常所说的分类器(Classifier)）。该函数或模型能够把数据库中的数据纪录映射到给定类别中的某一个，从而可以应用于数据预测。总之，分类器是数据挖掘中对样本进行分类的方法的统称，包含决策树、逻辑回归、朴素贝叶斯、神经网络等算法。

大语言模型机器遗忘综述

专知会员服务

18+阅读 · 2025年11月2日

大模型如何遗忘不良知识？最新《生成式人工智能中的机器遗忘》综述

专知会员服务

24+阅读 · 2024年8月1日

机器遗忘综述：技术与新出现的隐私风险

专知会员服务

24+阅读 · 2024年6月16日

机器遗忘：分类、指标、应用、挑战与展望

专知会员服务

36+阅读 · 2024年3月16日