Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs
翻译:近年来,大型视觉语言模型(LVLMs)在多模态任务中展现出突破性能力。然而,这些模型仍易受对抗性越狱攻击的影响,攻击者通过构建细微扰动来绕过安全机制并触发有害输出。现有的白盒攻击方法需要完整的模型访问权限,存在计算成本高、对抗迁移性不足等问题,难以应用于现实世界的黑盒场景。为解决这些局限性,我们提出一种基于同时扰动随机逼近零阶优化(ZO-SPSA)的黑盒越狱攻击方法。ZO-SPSA具有三大优势:(i)通过输入输出交互实现无需模型知识的无梯度逼近;(ii)无需代理模型的模型无关优化;(iii)降低GPU内存消耗的资源需求。我们在InstructBLIP、LLaVA和MiniGPT-4三个LVLM上评估ZO-SPSA,在InstructBLIP上实现了83.0%的最高越狱成功率,同时保持与白盒方法相当的不可感知扰动。此外,从MiniGPT-4生成的对抗样本对其他LVLM表现出强迁移性,攻击成功率(ASR)达64.18%。这些发现揭示了黑盒越狱在现实场景中的可行性,并暴露了当前LVLMs安全机制的关键缺陷。