Denoising probabilistic diffusion models have shown breakthrough performance to generate more photo-realistic images or human-level illustrations than the prior models such as GANs. This high image-generation capability has stimulated the creation of many downstream applications in various areas. However, we find that this technology is actually a double-edged sword: We identify a new type of attack, called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features, which are essential to the human visual system (HVS), through text prompts. The NDD attack shows a significantly high capability to generate low-cost, model-agnostic, and transferable adversarial attacks by exploiting the natural attack capability in diffusion models. To systematically evaluate the risk of the NDD attack, we perform a large-scale empirical study with our newly created dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the natural attack capability by answering 6 research questions. Through a user study, we find that it can achieve an 88% detection rate while being stealthy to 93% of human subjects; we also find that the non-robust features embedded by diffusion models contribute to the natural attack capability. To confirm the model-agnostic and transferable attack capability, we perform the NDD attack against the Tesla Model 3 and find that 73% of the physically printed attacks can be detected as stop signs. Our hope is that the study and dataset can help our community be aware of the risks in diffusion models and facilitate further research toward robust DNN models.
翻译:去噪概率扩散模型在生成比GAN等先前模型更具照片真实感或达到人类水平插画的图像方面,展现出突破性性能。这种高图像生成能力催生了各领域众多下游应用的创新。然而,我们发现该技术实则是一把双刃剑:基于最先进的深度神经网络模型即使通过文本提示有意移除对人眼视觉系统至关重要的鲁棒特征时仍能保持其预测能力这一发现,我们识别出一种新型攻击——自然去噪扩散攻击。NDD攻击通过利用扩散模型的内在自然攻击能力,展现出生成低成本、模型无关且可迁移对抗样本的极高潜力。为系统评估NDD攻击风险,我们基于新构建的自然去噪扩散攻击数据集开展了大规模实证研究,通过回答6个研究问题来评估自然攻击能力。用户研究表明,该攻击在对93%的受试者保持隐蔽性的同时,能达到88%的检测率;我们还发现扩散模型嵌入的非鲁棒特征是自然攻击能力的主要成因。为验证模型无关与可迁移攻击能力,我们对特斯拉Model 3实施NDD攻击,结果表明73%的物理打印攻击可被误检为停车标志。我们期望本项研究与数据集能警示学界关注扩散模型风险,并推动面向鲁棒深度神经网络模型的进一步研究。