Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy

The high cost of model training makes it increasingly desirable to develop techniques for unlearning. These techniques seek to remove the influence of a training example without having to retrain the model from scratch. Intuitively, once a model has unlearned, an adversary that interacts with the model should no longer be able to tell whether the unlearned example was included in the model's training set or not. In the privacy literature, this is known as membership inference. In this work, we discuss adaptations of Membership Inference Attacks (MIAs) to the setting of unlearning (leading to their ``U-MIA'' counterparts). We propose a categorization of existing U-MIAs into ``population U-MIAs'', where the same attacker is instantiated for all examples, and ``per-example U-MIAs'', where a dedicated attacker is instantiated for each example. We show that the latter category, wherein the attacker tailors its membership prediction to each example under attack, is significantly stronger. Indeed, our results show that the commonly used U-MIAs in the unlearning literature overestimate the privacy protection afforded by existing unlearning techniques on both vision and language models. Our investigation reveals a large variance in the vulnerability of different examples to per-example U-MIAs. In fact, several unlearning algorithms lead to a reduced vulnerability for some, but not all, examples that we wish to unlearn, at the expense of increasing it for other examples. Notably, we find that the privacy protection for the remaining training examples may worsen as a consequence of unlearning. We also discuss the fundamental difficulty of equally protecting all examples using existing unlearning schemes, due to the different rates at which examples are unlearned. We demonstrate that naive attempts at tailoring unlearning stopping criteria to different examples fail to alleviate these issues.

翻译：模型训练的高昂成本使得开发遗忘技术日益迫切。这些技术旨在移除训练样本的影响，而无需从头重新训练模型。直观而言，一旦模型完成遗忘，与该模型交互的对手将无法再判断被遗忘样本是否曾包含在模型的训练集中。在隐私保护文献中，这被称为成员推断攻击。本文探讨了成员推断攻击（MIA）在遗忘场景下的适应性改进（形成其“U-MIA”变体）。我们提出将现有U-MIA分为两类：“群体U-MIA”（对所有样本实例化相同的攻击者）和“逐样本U-MIA”（对每个样本实例化专属攻击者）。研究表明，后者中攻击者针对每个被攻击样本定制成员预测的能力显著更强。事实上，我们的结果显示，遗忘文献中常用的U-MIA高估了现有遗忘技术在视觉模型和语言模型上提供的隐私保护。我们的研究揭示了不同样本对逐样本U-MIA的脆弱性存在巨大差异。值得注意的是，某些遗忘算法会降低部分（而非全部）待遗忘样本的脆弱性，却以增加其他样本的脆弱性为代价。我们特别发现，遗忘操作可能导致剩余训练样本的隐私保护恶化。此外，本文讨论了由于样本遗忘速率差异，现有遗忘方案难以平等保护所有样本的根本困难。我们证明，针对不同样本定制遗忘终止阈值的简单尝试无法缓解这些问题。