Recently, adversarial attacks on image classification networks by the AutoAttack (Croce and Hein, 2020b) framework have drawn a lot of attention. While AutoAttack has shown a very high attack success rate, most defense approaches are focusing on network hardening and robustness enhancements, like adversarial training. This way, the currently best-reported method can withstand about 66% of adversarial examples on CIFAR10. In this paper, we investigate the spatial and frequency domain properties of AutoAttack and propose an alternative defense. Instead of hardening a network, we detect adversarial attacks during inference, rejecting manipulated inputs. Based on a rather simple and fast analysis in the frequency domain, we introduce two different detection algorithms. First, a black box detector that only operates on the input images and achieves a detection accuracy of 100% on the AutoAttack CIFAR10 benchmark and 99.3% on ImageNet, for epsilon = 8/255 in both cases. Second, a whitebox detector using an analysis of CNN feature maps, leading to a detection rate of also 100% and 98.7% on the same benchmarks.
翻译:近期,AutoAttack(Croce和Hein,2020b)框架对图像分类网络的对抗攻击引起了广泛关注。尽管AutoAttack展现出极高的攻击成功率,但多数防御方法仍侧重于网络加固与鲁棒性增强,例如对抗训练。当前,最佳报告方法在CIFAR10数据集上仅能抵御约66%的对抗样本。本文研究了AutoAttack在空间域与频域中的特性,并提出了一种替代性防御策略:通过推理阶段检测对抗攻击来拒绝被操控的输入,而非加固网络。基于频域中较为简单且快速的分析,我们引入了两种不同的检测算法。第一种为黑盒检测器,仅处理输入图像,在ε=8/255的条件下,对AutoAttack CIFAR10基准的检测准确率达100%,对ImageNet基准达99.3%。第二种为白盒检测器,通过分析CNN特征图实现检测,在同一基准上的检测率同样达到100%与98.7%。