Achieving resiliency against adversarial attacks is necessary prior to deploying neural network classifiers in domains where misclassification incurs substantial costs, e.g., self-driving cars or medical imaging. Recent work has demonstrated that robustness can be transferred from an adversarially trained teacher to a student model using knowledge distillation. However, current methods perform distillation using a single adversarial and vanilla teacher and consider homogeneous architectures (i.e., residual networks) that are susceptible to misclassify examples from similar adversarial subspaces. In this work, we develop a defense framework against adversarial attacks by distilling adversarial robustness using heterogeneous teachers (DARHT). In DARHT, the student model explicitly represents teacher logits in a student-teacher feature map and leverages multiple teachers that exhibit low adversarial example transferability (i.e., exhibit high performance on dissimilar adversarial examples). Experiments on classification tasks in both white-box and black-box scenarios demonstrate that DARHT achieves state-of-the-art clean and robust accuracies when compared to competing adversarial training and distillation methods in the CIFAR-10, CIFAR-100, and Tiny ImageNet datasets. Comparisons with homogeneous and heterogeneous teacher sets suggest that leveraging teachers with low adversarial example transferability increases student model robustness.
翻译:在误分类会造成重大成本的领域(例如自动驾驶汽车或医学影像)中,部署神经网络分类器之前,必须具备抵御对抗攻击的能力。近期研究表明,通过知识蒸馏,可以将对抗训练后的教师模型的鲁棒性迁移至学生模型。然而,现有方法仅使用单个对抗训练教师模型和普通教师模型进行蒸馏,并考虑同构架构(例如残差网络),这类架构容易将来自相似对抗子空间的样本错误分类。在本工作中,我们提出了一种利用异构教师模型提取对抗鲁棒性的防御框架(DARHT)。在DARHT中,学生模型在学生-教师特征图中显式表示教师模型的逻辑值,并利用多个低对抗样本迁移性(即对不相似对抗样本具有高鲁棒性)的教师模型。在白盒与黑盒场景下的分类任务实验表明,在CIFAR-10、CIFAR-100和Tiny ImageNet数据集上,与对抗训练和蒸馏方法相比,DARHT在干净样本与鲁棒准确率上均达到了当前最优水平。与同构和异构教师模型的对比进一步表明,利用低对抗样本迁移性的教师模型能提升学生模型的鲁棒性。