Machine unlearning has emerged as a prominent and challenging area of interest, driven in large part by the rising regulatory demands for industries to delete user data upon request and the heightened awareness of privacy. Existing approaches either retrain models from scratch or use several finetuning steps for every deletion request, often constrained by computational resource limitations and restricted access to the original training data. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate an entire class or a group of classes from the learned model. To that end, our algorithm first estimates the Retain Space and the Forget Space, representing the feature or activation spaces for samples from classes to be retained and unlearned, respectively. To obtain these spaces, we propose a novel singular value decomposition-based technique that requires layer wise collection of network activations from a few forward passes through the network. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space for unlearning. Finally, we project the model weights in the orthogonal direction of the class-discriminatory space to obtain the unlearned model. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only $\sim$1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being $\sim$6x more computationally efficient.
翻译:机器遗忘已成为一个显著且具有挑战性的研究领域,其主要驱动力来自行业需按用户请求删除数据的日益增加的监管要求,以及公众对隐私问题的高度关注。现有方法要么从头重新训练模型,要么针对每个删除请求使用若干微调步骤,但这通常受限于计算资源限制和对原始训练数据的有限访问。本文提出一种新颖的类别遗忘算法,旨在从已学习模型中策略性地移除整个类别或一组类别。为此,我们的算法首先估计保留空间和遗忘空间,分别表示需保留和遗忘类别的样本对应的特征或激活空间。为获取这些空间,我们提出一种基于奇异值分解的新技术,只需通过网络进行少数前向传播即可逐层收集网络激活值。接着,我们计算这些空间之间的共享信息,并将其从遗忘空间中移除,以分离出用于遗忘的类别判别性特征空间。最后,我们将模型权重投影到类别判别性空间的正交方向上,从而获得遗忘后的模型。我们在ImageNet数据集上使用Vision Transformer验证了算法的有效性,与原始模型相比,保留准确率仅下降约1.5%,同时在遗忘类别样本上的准确率保持在1%以下。此外,当面对成员推断攻击时,我们的算法持续表现良好,在多种图像分类数据集和网络架构上平均改进7.8%,且计算效率提高约6倍。