Many existing adversarial attacks generate $L_p$-norm perturbations on image RGB space. Despite some achievements in transferability and attack success rate, the crafted adversarial examples are easily perceived by human eyes. Towards visual imperceptibility, some recent works explore unrestricted attacks without $L_p$-norm constraints, yet lacking transferability of attacking black-box models. In this work, we propose a novel imperceptible and transferable attack by leveraging both the generative and discriminative power of diffusion models. Specifically, instead of direct manipulation in pixel space, we craft perturbations in latent space of diffusion models. Combined with well-designed content-preserving structures, we can generate human-insensitive perturbations embedded with semantic clues. For better transferability, we further "deceive" the diffusion model which can be viewed as an additional recognition surrogate, by distracting its attention away from the target regions. To our knowledge, our proposed method, DiffAttack, is the first that introduces diffusion models into adversarial attack field. Extensive experiments on various model structures (including CNNs, Transformers, MLPs) and defense methods have demonstrated our superiority over other attack methods.
翻译:现有许多对抗攻击方法在图像RGB空间生成$L_p$-范数扰动。尽管在迁移性和攻击成功率方面取得一定进展,但生成的对抗样本极易被肉眼察觉。为追求视觉不可感知性,部分近期研究探索了无$L_p$-范数约束的非受限攻击,但其对黑盒模型的攻击缺乏迁移性。本文通过利用扩散模型的生成与判别能力,提出一种兼具不可感知性与可迁移性的新型攻击方法。具体而言,我们不在像素空间直接操作,而是在扩散模型的隐空间中构造扰动。结合精心设计的保持内容结构,可生成嵌入语义线索且对人眼不敏感的扰动。为提升迁移性,我们进一步通过分散其对目标区域的注意力来"欺骗"扩散模型,将其视为额外的识别代理模型。据我们所知,所提方法DiffAttack首次将扩散模型引入对抗攻击领域。针对包括卷积神经网络(CNNs)、Transformer、多层感知机(MLPs)在内的多种模型结构以及防御方法的广泛实验证明,我们的方法优于其他攻击方法。