In this paper, we study the harmlessness alignment problem of multimodal large language models~(MLLMs). We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs and reveal that the image input poses the alignment vulnerability of MLLMs. Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input, using meticulously crafted images. Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate~(ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision. Our code and data will be publicly released.
翻译:在本文中,我们研究多模态大语言模型(MLLMs)的无害对齐问题。通过对代表性MLLMs的无害性表现进行系统性实证分析,我们揭示了图像输入构成了MLLMs的对齐脆弱性。受此启发,我们提出一种名为HADES的新型破解方法,该方法利用精心设计的图像,隐藏并放大文本输入中恶意意图的危害性。实验结果表明,HADES能够有效破解现有MLLMs,在LLaVA-1.5上平均攻击成功率达90.26%,在Gemini Pro Vision上达71.60%。我们的代码和数据将公开发布。