Multi-turn jailbreak attacks are effective against text-only large language models (LLMs) by gradually introducing malicious content across turns. When extended to large vision-language models (LVLMs), we find that naively adding visual inputs can cause existing multi-turn jailbreaks to be easily defended. For example, overly malicious visual input will easily trigger the defense mechanism of safety-aligned LVLMs, making the response more conservative. To address this, we propose MAPA: a multi-turn adaptive prompting attack that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 11-35% on recent benchmarks against LLaVA-V1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini.
翻译:多轮越狱攻击通过在多轮对话中逐步引入恶意内容,对纯文本大型语言模型(LLMs)具有显著效果。然而,当我们将此类攻击扩展到大型视觉语言模型(LVLMs)时,发现简单地添加视觉输入会导致现有的多轮越狱攻击容易被防御。例如,恶意程度过高的视觉输入极易触发经过安全对齐的LVLMs的防御机制,使其回应趋于保守。为解决此问题,我们提出了MAPA:一种多轮自适应提示攻击方法,其设计包含两个层面:1)在每一轮中,交替使用文本和视觉攻击动作,以诱导出最具恶意的回应;2)在多轮之间,通过迭代式的来回优化调整攻击轨迹,以逐步放大回应的恶意程度。这种双层设计使得MAPA在多个最新基准测试中(针对LLaVA-V1.6-Mistral-7B、Qwen2.5-VL-7B-Instruct、Llama-3.2-Vision-11B-Instruct和GPT-4o-mini模型)持续优于现有最先进方法,将攻击成功率提升了11%至35%。