利用黑盒优化为大型视觉语言模型构建对抗性输入 (Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization)

Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs

翻译：近年来，大型视觉语言模型（LVLMs）在多模态任务中展现出突破性能力。然而，这些模型仍易受对抗性越狱攻击的影响，攻击者通过构建细微扰动来绕过安全机制并触发有害输出。现有的白盒攻击方法需要完整的模型访问权限，存在计算成本高、对抗迁移性不足等问题，难以应用于现实世界的黑盒场景。为解决这些局限性，我们提出一种基于同时扰动随机逼近零阶优化（ZO-SPSA）的黑盒越狱攻击方法。ZO-SPSA具有三大优势：（i）通过输入输出交互实现无需模型知识的无梯度逼近；（ii）无需代理模型的模型无关优化；（iii）降低GPU内存消耗的资源需求。我们在InstructBLIP、LLaVA和MiniGPT-4三个LVLM上评估ZO-SPSA，在InstructBLIP上实现了83.0%的最高越狱成功率，同时保持与白盒方法相当的不可感知扰动。此外，从MiniGPT-4生成的对抗样本对其他LVLM表现出强迁移性，攻击成功率（ASR）达64.18%。这些发现揭示了黑盒越狱在现实场景中的可行性，并暴露了当前LVLMs安全机制的关键缺陷。

相关内容

黑盒

关注 1

在科学，计算和工程学中，黑盒是一种设备，系统或对象，可以根据其输入和输出（或传输特性）对其进行查看，而无需对其内部工作有任何了解。它的实现是“不透明的”（黑色）。几乎任何事物都可以被称为黑盒：晶体管，引擎，算法，人脑，机构或政府。为了使用典型的“黑匣子方法”来分析建模为开放系统的事物，仅考虑刺激/响应的行为，以推断（未知）盒子。该黑匣子系统的通常表示形式是在该方框中居中的数据流程图。黑盒的对立面是一个内部组件或逻辑可用于检查的系统，通常将其称为白盒（有时也称为“透明盒”或“玻璃盒”）。

【AAAI2026】TOFA：面向视觉-语言模型的免训练一次性联邦自适应方法

专知会员服务

13+阅读 · 2025年11月23日

面向具身操作的高效视觉–语言–动作模型：系统综述

专知会员服务

22+阅读 · 2025年10月22日

【ICLR2025】为多模态图像-文本表示可解释性缩小信息瓶颈理论

专知会员服务

15+阅读 · 2025年2月24日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日