With the rapid advancement of multimodal large language models (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry. Although MLLMs are vulnerable to jailbreak attacks, designing effective multimodal jailbreak attacks poses unique challenges, especially given the distinct protective measures implemented across various modalities in commercial models. Previous works concentrate risks into a single modality, resulting in limited jailbreak performance. In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. The multimodal risk distribution strategy is used to segment harmful instructions across multiple modalities to effectively circumvent MLLMs' security protection. The heuristic-induced search strategy identifies two types of prompts: the understanding-enhancing prompt, which helps the MLLM reconstruct the malicious prompt, and the inducing prompt, which increases the likelihood of affirmative outputs over refusals, enabling a successful jailbreak attack. Extensive experiments demonstrate that this approach effectively uncovers vulnerabilities in MLLMs, achieving an average attack success rate of 90% across seven popular open-source MLLMs and an average attack success rate of around 68% in three popular closed-source MLLMs. Our code will coming soon. Warning: This paper contains offensive and harmful examples, reader discretion is advised.
翻译:随着多模态大语言模型(MLLMs)的快速发展,其安全性问题日益受到学术界和工业界的关注。尽管MLLMs容易受到越狱攻击,但设计有效的多模态越狱攻击面临独特挑战,尤其是在考虑到商业模型中不同模态所实施的不同保护措施时。先前的研究将风险集中于单一模态,导致越狱性能有限。本文提出一种启发式诱导多模态风险分布越狱攻击方法,称为HIMRD,该方法包含两个核心要素:多模态风险分布策略与启发式诱导搜索策略。多模态风险分布策略用于将有害指令分割至多个模态,以有效规避MLLMs的安全防护。启发式诱导搜索策略则识别两类提示:理解增强提示,用于帮助MLLM重建恶意提示;以及诱导提示,用于提高模型输出肯定答复而非拒绝的可能性,从而实现成功的越狱攻击。大量实验表明,该方法能有效揭示MLLMs的脆弱性,在七个流行的开源MLLMs中平均攻击成功率高达90%,在三个流行的闭源MLLMs中平均攻击成功率约为68%。我们的代码即将发布。警告:本文包含具有冒犯性和有害性的示例,建议读者谨慎阅读。