Potion: Towards Poison Unlearning

Adversarial attacks by malicious actors on machine learning systems, such as introducing poison triggers into training datasets, pose significant risks. The challenge in resolving such an attack arises in practice when only a subset of the poisoned data can be identified. This necessitates the development of methods to remove, i.e. unlearn, poison triggers from already trained models with only a subset of the poison data available. The requirements for this task significantly deviate from privacy-focused unlearning where all of the data to be forgotten by the model is known. Previous work has shown that the undiscovered poisoned samples lead to a failure of established unlearning methods, with only one method, Selective Synaptic Dampening (SSD), showing limited success. Even full retraining, after the removal of the identified poison, cannot address this challenge as the undiscovered poison samples lead to a reintroduction of the poison trigger in the model. Our work addresses two key challenges to advance the state of the art in poison unlearning. First, we introduce a novel outlier-resistant method, based on SSD, that significantly improves model protection and unlearning performance. Second, we introduce Poison Trigger Neutralisation (PTN) search, a fast, parallelisable, hyperparameter search that utilises the characteristic "unlearning versus model protection" trade-off to find suitable hyperparameters in settings where the forget set size is unknown and the retain set is contaminated. We benchmark our contributions using ResNet-9 on CIFAR10 and WideResNet-28x10 on CIFAR100. Experimental results show that our method heals 93.72% of poison compared to SSD with 83.41% and full retraining with 40.68%. We achieve this while also lowering the average model accuracy drop caused by unlearning from 5.68% (SSD) to 1.41% (ours).

翻译：恶意行为者对机器学习系统的对抗性攻击，例如在训练数据集中植入毒药触发器，构成了重大风险。在实践中，解决此类攻击的挑战在于通常只能识别出部分被投毒的数据。这需要开发相应的方法，以仅利用可用的部分毒药数据，从已训练的模型中移除（即遗忘）毒药触发器。这项任务的要求与侧重于隐私的遗忘任务有显著不同，后者要求模型遗忘的所有数据都是已知的。先前的研究表明，未被发现的毒药样本会导致现有遗忘方法失效，只有一种方法——选择性突触阻尼（SSD）——显示出有限的成功。即使在移除已识别的毒药后进行完全重新训练，也无法应对这一挑战，因为未被发现的毒药样本会导致毒药触发器在模型中重新引入。我们的工作解决了两个关键挑战，以推进毒药遗忘领域的技术水平。首先，我们引入了一种基于SSD的新型抗离群值方法，显著提高了模型保护和遗忘性能。其次，我们引入了毒药触发器中立化（PTN）搜索，这是一种快速、可并行化的超参数搜索方法，它利用"遗忘与模型保护"之间的特征性权衡，在遗忘集大小未知且保留集被污染的情况下找到合适的超参数。我们使用ResNet-9在CIFAR10上以及WideResNet-28x10在CIFAR100上对我们的贡献进行了基准测试。实验结果表明，我们的方法修复了93.72%的毒药，而SSD为83.41%，完全重新训练为40.68%。我们在实现这一点的同时，还将由遗忘引起的平均模型准确率下降从5.68%（SSD）降低到了1.41%（我们的方法）。