Visual language pre-training (VLP) models have demonstrated significant success across various domains, yet they remain vulnerable to adversarial attacks. Addressing these adversarial vulnerabilities is crucial for enhancing security in multimodal learning. Traditionally, adversarial methods targeting VLP models involve simultaneously perturbing images and text. However, this approach faces notable challenges: first, adversarial perturbations often fail to translate effectively into real-world scenarios; second, direct modifications to the text are conspicuously visible. To overcome these limitations, we propose a novel strategy that exclusively employs image patches for attacks, thus preserving the integrity of the original text. Our method leverages prior knowledge from diffusion models to enhance the authenticity and naturalness of the perturbations. Moreover, to optimize patch placement and improve the efficacy of our attacks, we utilize the cross-attention mechanism, which encapsulates intermodal interactions by generating attention maps to guide strategic patch placements. Comprehensive experiments conducted in a white-box setting for image-to-text scenarios reveal that our proposed method significantly outperforms existing techniques, achieving a 100% attack success rate. Additionally, it demonstrates commendable performance in transfer tasks involving text-to-image configurations.
翻译:视觉语言预训练(VLP)模型已在多个领域展现出显著成功,但其仍易受对抗性攻击的影响。解决这些对抗性漏洞对于增强多模态学习的安全性至关重要。传统上,针对VLP模型的对抗方法通常涉及同时扰动图像和文本。然而,这种方法面临显著挑战:首先,对抗性扰动往往难以有效迁移到现实场景中;其次,对文本的直接修改通常非常显眼。为克服这些限制,我们提出了一种新颖的策略,该策略仅使用图像补丁进行攻击,从而保持原始文本的完整性。我们的方法利用扩散模型的先验知识来增强扰动的真实性与自然度。此外,为优化补丁放置并提升攻击效果,我们采用了交叉注意力机制,该机制通过生成注意力图来捕捉模态间交互,从而指导策略性的补丁放置。在图像到文本场景的白盒设置下进行的全面实验表明,我们提出的方法显著优于现有技术,实现了100%的攻击成功率。此外,在涉及文本到图像配置的迁移任务中,该方法也表现出值得称赞的性能。