A backdoor or Trojan attack is an important type of data poisoning attack against deep neural network (DNN) classifiers, wherein the training dataset is poisoned with a small number of samples that each possess the backdoor pattern (usually a pattern that is either imperceptible or innocuous) and which are mislabeled to the attacker's target class. When trained on a backdoor-poisoned dataset, a DNN behaves normally on most benign test samples but makes incorrect predictions to the target class when the test sample has the backdoor pattern incorporated (i.e., contains a backdoor trigger). Here we focus on image classification tasks and show that supervised training may build stronger association between the backdoor pattern and the associated target class than that between normal features and the true class of origin. By contrast, self-supervised representation learning ignores the labels of samples and learns a feature embedding based on images' semantic content. %We thus propose to use unsupervised representation learning to avoid emphasising backdoor-poisoned training samples and learn a similar feature embedding for samples of the same class. Using a feature embedding found by self-supervised representation learning, a data cleansing method, which combines sample filtering and re-labeling, is developed. Experiments on CIFAR-10 benchmark datasets show that our method achieves state-of-the-art performance in mitigating backdoor attacks.
翻译:后门攻击或木马攻击是针对深度神经网络分类器的一种重要数据投毒攻击方式。在此类攻击中,训练数据集被注入少量包含后门模式(通常为难以察觉或看似无害的模式)的样本,这些样本被错误标注为攻击者的目标类别。当采用被后门投毒的数据集训练时,深度神经网络对大多数良性测试样本表现正常,但若测试样本包含后门模式(即含有后门触发器),则会将其错误预测为目标类别。本文聚焦图像分类任务,研究表明监督训练在后门模式与目标类别之间建立的关联,可能比正常特征与原始真实类别之间的关联更为紧密。相比之下,自监督表示学习忽略样本标签,基于图像的语义内容学习特征嵌入。据此,我们提出利用无监督表示学习避免对后门投毒训练样本的过度强调,并为同类样本学习相似的特征嵌入。基于自监督表示学习获得的特征嵌入,我们开发了一种结合样本过滤与重标记的数据清洗方法。在CIFAR-10基准数据集上的实验表明,本方法在缓解后门攻击方面达到了当前最优性能。