Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.

翻译：大型视觉语言模型（LVLMs）在多模态任务中取得了令人瞩目的成功，但其对视觉输入的依赖使其面临严重的对抗性威胁。现有的基于编码器的攻击方法仅通过在视觉编码器上进行优化来扰动输入图像，而非针对整个LVLM进行端到端优化，这提供了一种计算高效的替代方案。然而，在现实的黑盒场景中，这些攻击在不同LVLM架构之间的可迁移性仍鲜为人知。为填补这一空白，我们首次针对LVLMs中基于编码器的对抗可迁移性进行了系统性研究。我们的贡献有三方面。首先，通过对八个不同的LVLM进行大规模基准测试，我们发现现有攻击表现出严重受限的可迁移性。其次，我们进行了深入分析，揭示了阻碍可迁移性的两个根本原因：（1）模型间视觉基础的不一致性，即不同模型将其注意力集中在不同的图像区域；（2）模型内冗余的语义对齐，即单个对象的语义信息分散在多个重叠的令牌表示中。第三，我们提出了语义引导的多模态攻击（SGMA），这是一个旨在增强可迁移性的新颖框架。受分析中所发现原因的启发，SGMA将扰动导向语义关键区域，并在全局和局部层面破坏跨模态的基础对齐。在不同受害模型和任务上进行的大量实验表明，SGMA比现有攻击实现了更高的可迁移性。这些结果揭示了LVLM部署中的关键安全风险，并强调了开发鲁棒多模态防御机制的迫切性。