Deep neural networks (DNNs) are known to be vulnerable to adversarial attacks that would trigger misclassification of DNNs but may be imperceptible to human perception. Adversarial defense has been an important way to improve the robustness of DNNs. Existing attack methods often construct adversarial examples relying on some metrics like the $\ell_p$ distance to perturb samples. However, these metrics can be insufficient to conduct adversarial attacks due to their limited perturbations. In this paper, we propose a new internal Wasserstein distance (IWD) to capture the semantic similarity of two samples, and thus it helps to obtain larger perturbations than currently used metrics such as the $\ell_p$ distance. We then apply the internal Wasserstein distance to perform adversarial attack and defense. In particular, we develop a novel attack method relying on IWD to calculate the similarities between an image and its adversarial examples. In this way, we can generate diverse and semantically similar adversarial examples that are more difficult to defend by existing defense methods. Moreover, we devise a new defense method relying on IWD to learn robust models against unseen adversarial examples. We provide both thorough theoretical and empirical evidence to support our methods.
翻译:深度神经网络(DNNs)已知易受对抗攻击的影响,此类攻击会引发DNNs误分类,但可能对人类感知而言难以察觉。对抗防御已成为提升DNNs鲁棒性的重要途径。现有攻击方法通常依赖$\ell_p$距离等度量来构建对抗样本,以扰动样本。然而,由于这些度量所能实现的扰动有限,它们可能不足以有效实施对抗攻击。本文提出一种新的内部Wasserstein距离(IWD),用于捕捉两个样本之间的语义相似性,从而有助于获得比当前使用的$\ell_p$距离等度量更大的扰动。我们随后将内部Wasserstein距离应用于对抗攻击与防御。具体而言,我们开发了一种基于IWD的新型攻击方法,用于计算图像与其对抗样本之间的相似性。通过这种方式,我们可以生成多样且语义相似的对抗样本,这些样本更难被现有防御方法所抵御。此外,我们设计了一种基于IWD的新型防御方法,用于学习针对未见对抗样本的鲁棒模型。我们提供了充分的理论与实证证据来支持我们的方法。