Machine unlearning, an emerging research topic focusing on compliance with data privacy regulations, enables trained models to remove the information learned from specific data. While many existing methods indirectly address this issue by intentionally injecting incorrect supervisions, they can drastically and unpredictably alter the decision boundaries and feature spaces, leading to training instability and undesired side effects. To fundamentally approach this task, we first analyze the changes in latent feature spaces between original and retrained models, and observe that the feature representations of samples not involved in training are closely aligned with the feature manifolds of previously seen samples in training. Based on these findings, we introduce a novel evaluation metric for machine unlearning, coined dimensional alignment, which measures the alignment between the eigenspaces of the forget and retain set samples. We employ this metric as a regularizer loss to build a robust and stable unlearning framework, which is further enhanced by integrating a self-distillation loss and an alternating training scheme. Our framework effectively eliminates information from the forget set and preserves knowledge from the retain set. Lastly, we identify critical flaws in established evaluation metrics for machine unlearning, and introduce new evaluation tools that more accurately reflect the fundamental goals of machine unlearning.
翻译:机器遗忘作为新兴的研究方向,致力于使训练后的模型能够移除从特定数据中学到的信息,以满足数据隐私法规的合规要求。现有方法多通过刻意注入错误监督信号间接解决此问题,但这些方法会剧烈且不可预测地改变决策边界与特征空间,导致训练不稳定及不良副作用。为从根本上解决该任务,我们首先分析了原始模型与重训练模型在潜在特征空间中的变化,观察到未参与训练样本的特征表示与训练中已见样本的特征流形紧密对齐。基于这些发现,我们提出了一种新颖的机器遗忘评估指标——维度对齐,该指标通过度量遗忘集与保留集样本特征空间本征子空间的对齐程度来实现。我们将此指标作为正则化损失函数,构建了一个鲁棒稳定的遗忘框架,并通过集成自蒸馏损失与交替训练策略进一步强化该框架。我们的框架能有效消除遗忘集信息并完整保留知识集知识。最后,我们指出了现有机器遗忘评估指标的关键缺陷,并引入了能更准确反映机器遗忘根本目标的新评估工具。