Denoising probabilistic diffusion models have shown breakthrough performance that can generate more photo-realistic images or human-level illustrations than the prior models such as GANs. This high image-generation capability has stimulated the creation of many downstream applications in various areas. However, we find that this technology is indeed a double-edged sword: We identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features, which are essential to the human visual system (HVS), by text prompts. The NDD attack can generate low-cost, model-agnostic, and transferrable adversarial attacks by exploiting the natural attack capability in diffusion models. Motivated by the finding, we construct a large-scale dataset, Natural Denoising Diffusion Attack (NDDA) dataset, to systematically evaluate the risk of the natural attack capability of diffusion models with state-of-the-art text-to-image diffusion models. We evaluate the natural attack capability by answering 6 research questions. Through a user study to confirm the validity of the NDD attack, we find that the NDD attack can achieve an 88% detection rate while being stealthy to 93% of human subjects. We also find that the non-robust features embedded by diffusion models contribute to the natural attack capability. To confirm the model-agnostic and transferrable attack capability, we perform the NDD attack against an AD vehicle and find that 73% of the physically printed attacks can be detected as a stop sign. We hope that our study and dataset can help our community to be aware of the risk of diffusion models and facilitate further research toward robust DNN models.
翻译:去噪概率扩散模型展现了突破性的性能,能够生成比GAN等先前模型更逼真的图像或达到人类水平的插图。这种强大的图像生成能力催生了众多领域的下游应用。然而,我们发现这项技术实为双刃剑:基于当前最先进的深度神经网络模型即便通过文本提示故意移除人类视觉系统所必需的鲁棒特征后仍能保持其预测这一发现,我们识别出一种新型攻击——自然去噪扩散攻击。该攻击通过利用扩散模型中的自然攻击能力,能够生成低成本、模型无关且可迁移的对抗性攻击。受此发现启发,我们构建了大规模数据集——自然去噪扩散攻击数据集,以系统评估基于最先进文本到图像扩散模型的自然攻击能力风险。通过回答6个研究问题评估自然攻击能力,我们开展用户研究确认NDD攻击的有效性,发现该攻击在93%的人类受试者中保持隐蔽性的同时,能达到88%的检测率。此外,我们发现扩散模型嵌入的非鲁棒特征促成了自然攻击能力。为验证模型无关与可迁移攻击能力,我们对自动驾驶车辆实施NDD攻击,发现73%的物理打印攻击会被识别为停止标志。我们希望本研究及数据集能帮助学界警惕扩散模型的风险,并推动鲁棒DNN模型的进一步研究。