In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data will be publicly released.
翻译:在本文中,我们研究了多模态大语言模型的无害性对齐问题。我们系统性地实证分析了代表性多模态大语言模型的无害性表现,揭示了图像输入构成其对齐漏洞。受此启发,我们提出了一种名为HADES的新型越狱方法,该方法通过精心构造的图像,隐藏并放大文本输入中恶意意图的危害性。实验结果表明,HADES能有效破解现有MLLM,在LLaVA-1.5和Gemini Pro Vision上分别达到90.26%和71.60%的平均攻击成功率。我们的代码和数据将公开发布。