We propose a novel clustering mechanism based on an incompatibility property between subsets of data that emerges during model training. This mechanism partitions the dataset into subsets that generalize only to themselves, i.e., training on one subset does not improve performance on the other subsets. Leveraging the interaction between the dataset and the training process, our clustering mechanism partitions datasets into clusters that are defined by--and therefore meaningful to--the objective of the training process. We apply our clustering mechanism to defend against data poisoning attacks, in which the attacker injects malicious poisoned data into the training dataset to affect the trained model's output. Our evaluation focuses on backdoor attacks against deep neural networks trained to perform image classification using the GTSRB and CIFAR-10 datasets. Our results show that (1) these attacks produce poisoned datasets in which the poisoned and clean data are incompatible and (2) our technique successfully identifies (and removes) the poisoned data. In an end-to-end evaluation, our defense reduces the attack success rate to below 1% on 134 out of 165 scenarios, with only a 2% drop in clean accuracy on CIFAR-10 and a negligible drop in clean accuracy on GTSRB.
翻译:我们提出了一种基于模型训练过程中数据子集间不兼容属性的新型聚类机制。该机制将数据集划分为仅能自我泛化的子集,即在一个子集上训练不会提升其他子集的性能。通过利用数据集与训练过程的交互作用,我们的聚类机制将数据集划分为由训练目标定义且对其有意义的簇。我们将该机制应用于防御数据投毒攻击——攻击者通过向训练数据注入恶意中毒数据以影响训练模型的输出。实验评估聚焦于针对深度神经网络的后门攻击,这些网络使用GTSRB和CIFAR-10数据集进行图像分类训练。结果表明:(1)此类攻击产生的中毒数据集中,中毒数据与干净数据存在不兼容性;(2)我们的技术能成功识别并移除中毒数据。在端到端评估中,该防御方法在165个场景中的134个场景中将攻击成功率降至1%以下,同时CIFAR-10上干净准确率仅下降2%,GTSRB上的干净准确率下降可忽略不计。