Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models

Adversarial attacks, particularly \textbf{targeted} transfer-based attacks, can be used to assess the adversarial robustness of large visual-language models (VLMs), allowing for a more thorough examination of potential security flaws before deployment. However, previous transfer-based adversarial attacks incur high costs due to high iteration counts and complex method structure. Furthermore, due to the unnaturalness of adversarial semantics, the generated adversarial examples have low transferability. These issues limit the utility of existing methods for assessing robustness. To address these issues, we propose AdvDiffVLM, which uses diffusion models to generate natural, unrestricted and targeted adversarial examples via score matching. Specifically, AdvDiffVLM uses Adaptive Ensemble Gradient Estimation to modify the score during the diffusion model's reverse generation process, ensuring that the produced adversarial examples have natural adversarial targeted semantics, which improves their transferability. Simultaneously, to improve the quality of adversarial examples, we use the GradCAM-guided Mask method to disperse adversarial semantics throughout the image rather than concentrating them in a single area. Finally, AdvDiffVLM embeds more target semantics into adversarial examples after multiple iterations. Experimental results show that our method generates adversarial examples 5x to 10x faster than state-of-the-art transfer-based adversarial attacks while maintaining higher quality adversarial examples. Furthermore, compared to previous transfer-based adversarial attacks, the adversarial examples generated by our method have better transferability. Notably, AdvDiffVLM can successfully attack a variety of commercial VLMs in a black-box environment, including GPT-4V.

翻译：对抗攻击，尤其是**定向**基于迁移的攻击，可用于评估大型视觉语言模型（VLM）的对抗鲁棒性，从而在部署前更彻底地检查潜在的安全漏洞。然而，以往的基于迁移的对抗攻击由于迭代次数高和方法结构复杂而导致成本高昂。此外，由于对抗语义的非自然性，生成的对抗样本可迁移性较低。这些问题限制了现有方法在评估鲁棒性方面的实用性。为解决这些问题，我们提出了AdvDiffVLM，该方法利用扩散模型通过分数匹配生成自然、无限制且定向的对抗样本。具体而言，AdvDiffVLM采用自适应集成梯度估计来修改扩散模型反向生成过程中的分数，确保生成的对抗样本具有自然的定向对抗语义，从而提升其可迁移性。同时，为提高对抗样本的质量，我们采用GradCAM引导的掩码方法将对抗语义分散至整个图像，而非集中于单一区域。最后，AdvDiffVLM通过多次迭代将更多目标语义嵌入对抗样本中。实验结果表明，我们的方法生成对抗样本的速度比最先进的基于迁移的对抗攻击快5至10倍，同时保持更高质量的对抗样本。此外，与以往的基于迁移的对抗攻击相比，我们的方法生成的对抗样本具有更好的可迁移性。值得注意的是，AdvDiffVLM能够在黑盒环境下成功攻击包括GPT-4V在内的多种商业VLM。