多面攻击：揭示配备防御机制的视觉语言模型中的跨模型脆弱性 (Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models)

The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack

翻译：视觉语言模型（VLMs）的滥用日益增多，促使服务提供商部署了多种安全防护措施，包括对齐微调、系统提示和内容审核。然而，这些防御机制在现实世界中对抗对抗性攻击的鲁棒性仍未得到充分探究。本文提出了多面攻击（MFA）框架，系统性地揭示了主流配备防御机制的VLMs（如GPT-4o、Gemini-Pro和Llama-4）中普遍存在的安全脆弱性。MFA的核心组件是注意力转移攻击（ATA），该方法将有害指令隐藏在具有竞争性目标的元任务中。我们基于奖励破解理论提供了该攻击成功的理论解释。为提升跨模型可迁移性，我们进一步引入了一种轻量级的迁移增强算法，结合简单的重复策略，无需模型特定微调即可联合绕过输入级和输出级过滤器。实验表明，针对某一视觉编码器优化的对抗图像能够广泛迁移到未见过的VLMs中，这表明共享的视觉表征造成了跨模型的安全脆弱性。总体而言，MFA实现了58.5%的成功率，并持续优于现有方法。在先进的商业模型上，MFA达到52.8%的成功率，较次优攻击方法提升34%。这些结果挑战了当前防御机制公认的鲁棒性，并凸显了现代VLMs中持续存在的安全弱点。代码：https://github.com/cure-lab/MultiFacetedAttack