In the exciting generative AI era, the diffusion model has emerged as a very powerful and widely adopted content generation and editing tool for various data modalities, making the study of their potential security risks very necessary and critical. Very recently, some pioneering works have shown the vulnerability of the diffusion model against backdoor attacks, calling for in-depth analysis and investigation of the security challenges of this popular and fundamental AI technique. In this paper, for the first time, we systematically explore the detectability of the poisoned noise input for the backdoored diffusion models, an important performance metric yet little explored in the existing works. Starting from the perspective of a defender, we first analyze the properties of the trigger pattern in the existing diffusion backdoor attacks, discovering the important role of distribution discrepancy in Trojan detection. Based on this finding, we propose a low-cost trigger detection mechanism that can effectively identify the poisoned input noise. We then take a further step to study the same problem from the attack side, proposing a backdoor attack strategy that can learn the unnoticeable trigger to evade our proposed detection scheme. Empirical evaluations across various diffusion models and datasets demonstrate the effectiveness of the proposed trigger detection and detection-evading attack strategy. For trigger detection, our distribution discrepancy-based solution can achieve a 100\% detection rate for the Trojan triggers used in the existing works. For evading trigger detection, our proposed stealthy trigger design approach performs end-to-end learning to make the distribution of poisoned noise input approach that of benign noise, enabling nearly 100\% detection pass rate with very high attack and benign performance for the backdoored diffusion models.
翻译:在激动人心的生成式AI时代,扩散模型已成为一种功能强大且广泛采用的内容生成与编辑工具,适用于多种数据模态,因此研究其潜在安全风险显得非常必要且关键。近期,一些开创性工作揭示了扩散模型易受后门攻击的脆弱性,这迫切需要对这一流行且基础的AI技术中的安全挑战进行深入分析和研究。本文首次系统性地探索了被后门攻击的扩散模型中恶意噪声输入的可检测性——这是现有研究中尚待深入探索的重要性能指标。我们从防御者的视角出发,首先分析了现有扩散后门攻击中触发器模式的性质,发现分布差异在木马检测中发挥关键作用。基于这一发现,我们提出了一种低成本的触发器检测机制,能够有效识别恶意输入噪声。进一步地,我们从攻击者的角度研究同一问题,提出了一种能学习不可察觉触发器以规避所提检测方案的后门攻击策略。跨多种扩散模型与数据集的实证评估表明,所提出的触发器检测与规避检测攻击策略均具有有效性。在触发器检测方面,我们的基于分布差异的方案能以100%的检测率识别现有工作中的木马触发器。在规避检测方面,我们提出的隐蔽触发器设计方法通过端到端学习使恶意噪声输入的分布逼近良性噪声,从而实现近乎100%的检测通过率,同时确保被后门攻击的扩散模型保持极高的攻击效能与良性性能。