Neural networks are known to be susceptible to adversarial samples: small variations of natural examples crafted to deliberately mislead the models. While they can be easily generated using gradient-based techniques in digital and physical scenarios, they often differ greatly from the actual data distribution of natural images, resulting in a trade-off between strength and stealthiness. In this paper, we propose a novel framework dubbed Diffusion-Based Projected Gradient Descent (Diff-PGD) for generating realistic adversarial samples. By exploiting a gradient guided by a diffusion model, Diff-PGD ensures that adversarial samples remain close to the original data distribution while maintaining their effectiveness. Moreover, our framework can be easily customized for specific tasks such as digital attacks, physical-world attacks, and style-based attacks. Compared with existing methods for generating natural-style adversarial samples, our framework enables the separation of optimizing adversarial loss from other surrogate losses (e.g., content/smoothness/style loss), making it more stable and controllable. Finally, we demonstrate that the samples generated using Diff-PGD have better transferability and anti-purification power than traditional gradient-based methods. Code will be released in https://github.com/xavihart/Diff-PGD
翻译:神经网络已知易受对抗样本攻击:这些对自然样本的微小改动旨在有意误导模型。尽管在数字和物理场景中可通过基于梯度的技术轻松生成此类样本,但它们通常与自然图像的真实数据分布存在显著差异,导致强度与隐蔽性之间存在权衡。本文提出一种名为基于扩散的投影梯度下降(Diff-PGD)的新型框架,用于生成逼真的对抗样本。通过利用扩散模型引导的梯度,Diff-PGD确保对抗样本在保持有效性的同时,仍贴近原始数据分布。此外,我们的框架可针对特定任务(如数字攻击、物理世界攻击和风格攻击)轻松定制。与现有生成自然风格对抗样本的方法相比,我们的框架能将对抗损失与其他代理损失(如内容/平滑度/风格损失)的优化分离,从而更加稳定可控。最后,我们证明使用Diff-PGD生成的样本相比传统基于梯度的方法具有更好的迁移性和反净化能力。代码将在https://github.com/xavihart/Diff-PGD发布。