Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy

The high cost of model training makes it increasingly desirable to develop techniques for unlearning. These techniques seek to remove the influence of a training example without having to retrain the model from scratch. Intuitively, once a model has unlearned, an adversary that interacts with the model should no longer be able to tell whether the unlearned example was included in the model's training set or not. In the privacy literature, this is known as membership inference. In this work, we discuss adaptations of Membership Inference Attacks (MIAs) to the setting of unlearning (leading to their ``U-MIA'' counterparts). We propose a categorization of existing U-MIAs into ``population U-MIAs'', where the same attacker is instantiated for all examples, and ``per-example U-MIAs'', where a dedicated attacker is instantiated for each example. We show that the latter category, wherein the attacker tailors its membership prediction to each example under attack, is significantly stronger. Indeed, our results show that the commonly used U-MIAs in the unlearning literature overestimate the privacy protection afforded by existing unlearning techniques on both vision and language models. Our investigation reveals a large variance in the vulnerability of different examples to per-example U-MIAs. In fact, several unlearning algorithms lead to a reduced vulnerability for some, but not all, examples that we wish to unlearn, at the expense of increasing it for other examples. Notably, we find that the privacy protection for the remaining training examples may worsen as a consequence of unlearning. We also discuss the fundamental difficulty of equally protecting all examples using existing unlearning schemes, due to the different rates at which examples are unlearned. We demonstrate that naive attempts at tailoring unlearning stopping criteria to different examples fail to alleviate these issues.

翻译：模型训练的高昂成本使得开发遗忘技术日益受到关注。这些技术旨在无需从头重新训练模型的情况下，移除某个训练样本对模型的影响。直观而言，一旦模型完成遗忘，与模型交互的对手将无法判断被遗忘样本是否存在于模型的训练集中。在隐私研究领域，这被称为成员推理攻击。本文讨论了成员推理攻击对遗忘场景的适配（即对应的“U-MIA”变体）。我们提出将现有U-MIA划分为两类：“群体U-MIA”，即对所有样本使用相同的攻击模型；“逐样本U-MIA”，即为每个样本实例化专用的攻击模型。研究表明，后一类攻击模型能针对每个受攻击样本定制其成员预测，其攻击效能显著更强。我们的实验结果证实，遗忘领域常用的U-MIA方法高估了现有遗忘技术（无论是视觉模型还是语言模型）提供的隐私保护水平。进一步分析发现，不同样本对逐样本U-MIA的脆弱性存在巨大差异。事实上，部分遗忘算法会降低某些（但并非全部）待遗忘样本的脆弱性，但这一收益是以增加其他样本的脆弱性为代价的。值得注意的是，我们发现遗忘操作可能导致剩余训练样本的隐私保护能力下降。我们还讨论了现有遗忘方案难以实现对所有样本的等量保护的根本原因——不同样本的遗忘速率存在差异。实验表明，简单地为不同样本定制遗忘停止标准的方法无法缓解这些问题。