Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and "complete the comic." Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.
翻译:多模态大语言模型(MLLMs)在纯文本大语言模型的基础上引入了视觉推理能力,但也带来了在视觉引导指令下新的安全失效模式。我们研究了漫画模板式越狱攻击,其将有害目标嵌入简单的三面板视觉叙事中,并引导模型进行角色扮演以"完成漫画"。基于JailbreakBench和JailbreakV,我们引入了ComicJailbreak——一个基于漫画的越狱基准测试,包含涵盖10类危害场景和5种任务设置的1,167个攻击实例。在15个最先进的多模态大语言模型(6个商业模型和9个开源模型)上,基于漫画的攻击实现了与强规则越狱相当的成功率,并显著优于纯文本和随机图像基线,在多个商业模型上的集成成功率超过90%。随后,利用现有防御方法,我们展示这些方法虽能有效对抗有害漫画,但在提示良性内容时会引发高拒绝率。最后,通过自动评判和针对性人工评估,我们发现当前安全评估器在处理敏感但无害内容时可能不可靠。我们的发现凸显了对叙事驱动型多模态越狱攻击具有鲁棒性的安全对齐需求的紧迫性。