LVLMs are widely used but vulnerable to illegal or unethical responses under jailbreak attacks. To ensure their responsible deployment in real-world applications, it is essential to understand their vulnerabilities. There are four main issues in current work: single-round attack limitation, insufficient dual-modal synergy, poor transferability to black-box models, and reliance on prompt engineering. To address these limitations, we propose BAMBA, a bimodal adversarial multi-round black-box jailbreak attacker for LVLMs. We first use an image optimizer to learn malicious features from a harmful corpus, then deepen these features through a bimodal optimizer through text-image interaction, generating adversarial text and image for jailbreak. Experiments on various LVLMs and datasets demonstrate that BAMBA outperforms other baselines.
翻译:大型视觉语言模型(LVLM)虽被广泛应用,但在越狱攻击下易产生非法或不道德的响应。为确保其在现实应用中的负责任部署,理解其脆弱性至关重要。现有研究存在四个主要问题:单轮攻击的局限性、双模态协同不足、对黑盒模型的可迁移性差以及过度依赖提示工程。为克服这些局限,本文提出BAMBA——一种针对LVLM的双模态对抗性多轮黑盒越狱攻击器。我们首先使用图像优化器从有害语料库中学习恶意特征,随后通过双模态优化器借助文本-图像交互深化这些特征,最终生成用于越狱的对抗性文本和图像。在不同LVLM和数据集上的实验表明,BAMBA的性能优于其他基线方法。