Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we define instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.
翻译:自近期数据保护法规(如《通用数据保护条例》)出台以来,在不重新从头训练的前提下,从预训练模型中删除从敏感数据习得信息的需求日益增长。神经网络对对抗攻击和不公平性的固有脆弱性,也要求一种鲁棒方法——既能以逐实例方式移除或纠正信息,又能保持对剩余数据的预测性能。为此,我们定义了逐实例遗忘学习,其目标是通过将每个实例从其原始预测错误分类或将其重新标记为不同标签,从预训练模型中删除关于一组实例的信息。我们还提出了两种减少对剩余数据遗忘的方法:1)利用对抗样本克服表示层面的遗忘;2)借助权重重要性指标定位导致传播不必要信息的网络参数。这两种方法仅需要预训练模型和待遗忘的数据实例,因此可在完整训练集不可用的现实场景中无缝应用。通过在多种图像分类基准上的广泛实验,我们证明该方法在单任务和持续遗忘场景中,既能有效遗忘给定实例,又能保留对剩余数据的知识。