Multimodal Large Language Models (MLLMs) have achieved remarkable performance but remain vulnerable to jailbreak attacks that can induce harmful content and undermine their secure deployment. Previous studies have shown that introducing additional inference steps, which disrupt security attention, can make MLLMs more susceptible to being misled into generating malicious content. However, these methods rely on single-image masking or isolated visual cues, which only modestly extend reasoning paths and thus achieve limited effectiveness, particularly against strongly aligned commercial closed-source models. To address this problem, in this paper, we propose Multi-Image Dispersion and Semantic Reconstruction (MIDAS), a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits, disperses them across multiple visual clues, and leverages cross-image reasoning to gradually reconstruct the malicious intent, thereby bypassing existing safety mechanisms. The proposed MIDAS enforces longer and more structured multi-image chained reasoning, substantially increases the model's reliance on visual cues while delaying the exposure of malicious semantics and significantly reducing the model's security attention, thereby improving the performance of jailbreak against advanced MLLMs. Extensive experiments across different datasets and MLLMs demonstrate that the proposed MIDAS outperforms state-of-the-art jailbreak attacks for MLLMs and achieves an average attack success rate of 81.46% across 4 closed-source MLLMs. Our code is available at this [link](https://github.com/Winnie-Lian/MIDAS).
翻译:多模态大语言模型(MLLMs)已取得显著性能,但仍易受越狱攻击的影响,此类攻击可诱导有害内容生成,从而危及其安全部署。先前研究表明,引入额外的推理步骤以扰乱安全注意力,可使MLLMs更易被误导生成恶意内容。然而,现有方法依赖于单图像掩码或孤立的视觉线索,仅能适度延长推理路径,因此效果有限,尤其针对强对齐的商业闭源模型。为解决此问题,本文提出多图像分散与语义重构(MIDAS),一种多模态越狱框架。该框架将有害语义分解为承载风险的子单元,将其分散于多个视觉线索中,并利用跨图像推理逐步重构恶意意图,从而绕过现有安全机制。所提出的MIDAS强制进行更长且更具结构化的多图像链式推理,显著增强了模型对视觉线索的依赖,同时延迟了恶意语义的暴露,并大幅降低了模型的安全注意力,从而提升了对先进MLLMs的越狱性能。在不同数据集和MLLMs上的大量实验表明,所提出的MIDAS优于当前最先进的MLLMs越狱攻击方法,在4个闭源MLLMs上平均攻击成功率达到81.46%。我们的代码可通过此[链接](https://github.com/Winnie-Lian/MIDAS)获取。