Adversarial attacks hamper the decision-making ability of neural networks by perturbing the input signal. The addition of calculated small distortion to images, for instance, can deceive a well-trained image classification network. In this work, we propose a novel attack technique called Sparse Adversarial and Interpretable Attack Framework (SAIF). Specifically, we design imperceptible attacks that contain low-magnitude perturbations at a small number of pixels and leverage these sparse attacks to reveal the vulnerability of classifiers. We use the Frank-Wolfe (conditional gradient) algorithm to simultaneously optimize the attack perturbations for bounded magnitude and sparsity with $O(1/\sqrt{T})$ convergence. Empirical results show that SAIF computes highly imperceptible and interpretable adversarial examples, and outperforms state-of-the-art sparse attack methods on the ImageNet dataset.
翻译:对抗攻击通过扰动输入信号来干扰神经网络的决策能力。例如,在图像中添加经过计算的微小失真,可以欺骗一个训练良好的图像分类网络。在本文中,我们提出一种名为稀疏对抗与可解释性攻击框架(SAIF)的新型攻击技术。具体而言,我们设计了在少量像素上包含低幅度扰动的不可感知攻击,并利用这些稀疏攻击来揭示分类器的脆弱性。我们使用Frank-Wolfe(条件梯度)算法,以$O(1/\sqrt{T})$的收敛速度同时优化攻击扰动的幅度边界和稀疏性。实验结果表明,SAIF生成了高度不可感知且可解释的对抗样本,并在ImageNet数据集上超越了当前最先进的稀疏攻击方法。