PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

from arxiv, This version is withdrawn to consolidate the submission under the corresponding author's primary account. The most recent and maintained version of this work can be found at arXiv:2603.09246

The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.

翻译：[translated abstract in Chinese] 随着大型视觉语言模型（LVLMs）复杂度的提升，旨在防止有害内容生成的安全对齐机制也取得了相应进展。然而，这些防御措施仍易受到复杂对抗攻击的威胁。现有越狱方法通常依赖直接且语义显式的提示，忽视了LVLM在多步推理过程中整合信息时存在的细微漏洞。本文受软件安全中面向返回编程（ROP）技术的启发，提出了一种新颖且有效的越狱框架。该方法将有害指令分解为一系列独立且良性的视觉构件。通过精心设计的文本提示引导输入序列，促使模型在推理过程中整合这些良性视觉构件，从而生成连贯的有害输出。这使得恶意意图具有涌现性，且难以通过任何单一组件检测。我们在SafeBench和MM-SafetyBench等成熟基准上开展实验，针对主流LVLM进行了广泛验证。结果表明，我们的方法在最新模型上持续且显著优于现有基线，实现了接近完美的攻击成功率（SafeBench上超过0.90），并将攻击成功率（ASR）最高提升0.39。本研究揭示了LVLM组合推理能力中存在的一个关键且未被充分探索的漏洞，凸显了需针对整个推理过程建立防御机制的紧迫需求。