Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and "complete the comic." Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.
翻译:多模态大语言模型(MLLMs)在纯文本大语言模型基础上扩展了视觉推理能力,但也带来了基于视觉指令的新安全失效模式。本文研究基于漫画模板的越狱攻击,该类攻击将有害目标嵌入简单的三帧视觉叙事中,引导模型进行角色扮演并“完成漫画”。基于JailbreakBench和JailbreakV,我们提出ComicJailbreak,一个基于漫画的越狱基准测试,包含覆盖10个危害类别和5种任务设置的1167个攻击实例。在15个最先进的多模态大语言模型(6个商业模型和9个开源模型)上,基于漫画的攻击成功率与强规则型越狱相当,显著优于纯文本和随机图像基线方法,在多个商业模型上的集成成功率超过90%。进一步,通过现有防御方法,我们发现这些方法虽能有效防御有害漫画,但要求模型对良性提示也保持高拒绝率。最后,利用自动评判和针对性人工评估,我们表明当前安全评估器在敏感但无害内容上可能不可靠。本研究凸显了需针对叙事驱动的多模态越狱攻击进行鲁棒性安全对齐的必要性。