Auditing Machine Unlearning: A Systematic Research on Whether Models Truly Forget

Machine unlearning has been extensively studied in response to growing privacy concerns and regulatory requirements. However, auditing whether unlearning algorithms have truly erased the influence of specific data remains an open challenge. The lack of reliable and practical auditing mechanisms can lead to critical privacy risks, such as residual information leakage. This paper initiates a systematic investigation into whether existing unlearning algorithms can truly forget the designated data. We propose the first practical and general-purpose auditing framework for machine unlearning, inspired by the concept of proof of ignorance. Our framework addresses the key practicality limitations of existing methods by eliminating the need for retraining-from-scratch baselines, avoiding the training of large numbers of shadow models, and requiring no intrusive intervention in the original training process. To evaluate the effectiveness of our framework, we first conduct validation experiments to verify its soundness and completeness. We then perform comprehensive experiments across six datasets and ten representative unlearning methods. The results demonstrate that our framework reliably distinguishes between successful and failed unlearning. In particular, we observe that retraining-based and fine-tuning-based methods can achieve effective unlearning, even when the target data remain in the original dataset. In contrast, de-optimization-based methods fail to achieve true unlearning and instead degrade the model's performance. Fisher/Hessian-based methods also fail to unlearn requested data, even formal certification is provided. Moreover, we show that our framework is robust against fake unlearning attempts and generalizes well to large language models.

翻译：机器遗忘因日益增长的隐私关切和监管要求而受到广泛研究。然而，审计遗忘算法是否真正消除了特定数据的影响仍是一个悬而未决的挑战。缺乏可靠且实用的审计机制可能导致严重的隐私风险，例如残留信息泄露。本文首次系统探究现有遗忘算法是否能够真正遗忘指定数据。受无知证明概念的启发，我们提出了首个实用且通用的机器遗忘审计框架。该框架通过消除从头开始重训练基线、避免训练大量影子模型以及无需对原始训练过程进行侵入式干预，解决了现有方法在实用性方面的关键限制。为评估框架有效性，我们首先进行验证实验以检验其合理性与完备性，随后在六个数据集和十种代表性遗忘方法上开展全面实验。结果表明，我们的框架能够可靠区分成功遗忘与失败遗忘。特别地，我们发现基于重训练和基于微调的方法可实现有效遗忘，即使目标数据仍保留在原始数据集中。相比之下，基于去优化的方法未能实现真正遗忘，反而导致模型性能下降。基于费舍尔/海森矩阵的方法（即便提供形式化认证）也未能成功遗忘请求数据。此外，我们证明该框架对虚假遗忘尝试具有鲁棒性，并能良好泛化至大型语言模型。