Diffusion-based purification (DBP) has become a cornerstone defense against adversarial examples (AEs), regarded as robust due to its use of diffusion models (DMs) that project AEs onto the natural data manifold. We refute this core claim, theoretically proving that gradient-based attacks effectively target the DM rather than the classifier, causing DBP's outputs to align with adversarial distributions. This prompts a reassessment of DBP's robustness, attributing it to two critical flaws: incorrect gradients and inappropriate evaluation protocols that test only a single random purification of the AE. We show that with proper accounting for stochasticity and resubmission risk, DBP collapses. To support this, we introduce DiffBreak, the first reliable toolkit for differentiation through DBP, eliminating gradient flaws that previously further inflated robustness estimates. We also analyze the current defense scheme used for DBP where classification relies on a single purification, pinpointing its inherent invalidity. We provide a statistically grounded majority-vote (MV) alternative that aggregates predictions across multiple purified copies, showing partial but meaningful robustness gain. We then propose a novel adaptation of an optimization method against deepfake watermarking, crafting systemic perturbations that defeat DBP even under MV, challenging DBP's viability.
翻译:基于扩散的净化(DBP)已成为对抗对抗样本(AEs)的核心防御手段,因其利用扩散模型(DMs)将AEs投影到自然数据流形上而被视为具有鲁棒性。我们反驳了这一核心主张,从理论上证明了基于梯度的攻击能有效针对DM而非分类器,导致DBP的输出与对抗性分布对齐。这促使我们重新评估DBP的鲁棒性,并将其归因于两个关键缺陷:错误的梯度以及仅测试AE单次随机净化的不当评估协议。我们表明,在恰当考虑随机性和重提交风险的情况下,DBP会失效。为支持这一结论,我们引入了DiffBreak,这是首个可靠的对DBP进行微分的工具包,消除了先前进一步夸大鲁棒性估计的梯度缺陷。我们还分析了当前用于DBP的防御方案——其分类依赖于单次净化,指出了其固有的无效性。我们提供了一种基于统计的多数投票(MV)替代方案,该方案聚合多个净化副本的预测,显示出部分但有意义的鲁棒性提升。随后,我们提出了一种针对深度伪造水印的优化方法的新颖改编,构建了即使在MV下也能击败DBP的系统性扰动,从而挑战了DBP的可行性。