Neural networks are known to be susceptible to adversarial samples: small variations of natural examples crafted to deliberately mislead the models. While they can be easily generated using gradient-based techniques in digital and physical scenarios, they often differ greatly from the actual data distribution of natural images, resulting in a trade-off between strength and stealthiness. In this paper, we propose a novel framework dubbed Diffusion-Based Projected Gradient Descent (Diff-PGD) for generating realistic adversarial samples. By exploiting a gradient guided by a diffusion model, Diff-PGD ensures that adversarial samples remain close to the original data distribution while maintaining their effectiveness. Moreover, our framework can be easily customized for specific tasks such as digital attacks, physical-world attacks, and style-based attacks. Compared with existing methods for generating natural-style adversarial samples, our framework enables the separation of optimizing adversarial loss from other surrogate losses (e.g., content/smoothness/style loss), making it more stable and controllable. Finally, we demonstrate that the samples generated using Diff-PGD have better transferability and anti-purification power than traditional gradient-based methods. Code will be released in https://github.com/xavihart/Diff-PGD
翻译:神经网络已知对对抗样本敏感:这些样本是自然例子的微小变异,旨在故意误导模型。尽管在数字和物理场景中可以通过基于梯度的技术轻松生成,但它们往往与自然图像的实际数据分布存在显著差异,导致强度与隐蔽性之间存在权衡。本文提出了一种名为基于扩散的投影梯度下降(Diff-PGD)的新框架,用于生成逼真的对抗样本。通过利用扩散模型引导的梯度,Diff-PGD确保对抗样本既保持与原始数据分布的接近,又维持其有效性。此外,我们的框架可轻松针对特定任务进行定制,如数字攻击、物理世界攻击和基于风格的攻击。与现有的自然风格对抗样本生成方法相比,本框架实现了对抗损失与其他替代损失(如内容/平滑度/风格损失)的分离,从而更加稳定和可控。最后,我们证明通过Diff-PGD生成的样本比传统基于梯度的方法具有更好的可迁移性和抗净化能力。代码将发布在https://github.com/xavihart/Diff-PGD