Large Vision-Language Models (LVLMs) undergo safety alignment to suppress harmful content. However, current defenses predominantly target explicit malicious patterns in the input representation, often overlooking the vulnerabilities inherent in compositional reasoning. In this paper, we identify a systemic flaw where LVLMs can be induced to synthesize harmful logic from benign premises. We formalize this attack paradigm as \textit{Reasoning-Oriented Programming}, drawing a structural analogy to Return-Oriented Programming in systems security. Just as ROP circumvents memory protections by chaining benign instruction sequences, our approach exploits the model's instruction-following capability to orchestrate a semantic collision of orthogonal benign inputs. We instantiate this paradigm via \tool{}, an automated framework that optimizes for \textit{semantic orthogonality} and \textit{spatial isolation}. By generating visual gadgets that are semantically decoupled from the harmful intent and arranging them to prevent premature feature fusion, \tool{} forces the malicious logic to emerge only during the late-stage reasoning process. This effectively bypasses perception-level alignment. We evaluate \tool{} on SafeBench and MM-SafetyBench across 7 state-of-the-art 0.LVLMs, including GPT-4o and Claude 3.7 Sonnet. Our results demonstrate that \tool{} consistently circumvents safety alignment, outperforming the strongest existing baseline by an average of 4.67\% on open-source models and 9.50\% on commercial models.
翻译:大型视觉语言模型(LVLMs)经过安全对齐以抑制有害内容生成。然而,当前防御机制主要针对输入表征中的显性恶意模式,往往忽视了组合推理中固有的脆弱性。本文揭示了一种系统性缺陷:LVLMs可能被诱导从良性前提中合成有害逻辑。我们将此攻击范式形式化为\textit{推理导向编程},在结构上类比系统安全中的返回导向编程。正如ROP通过链接良性指令序列来规避内存保护,我们的方法利用模型的指令跟随能力,编排正交良性输入的语义碰撞。我们通过\tool{}工具实例化该范式——这是一个自动化框架,旨在优化\textit{语义正交性}与\textit{空间隔离性}。通过生成语义上与有害意图解耦的视觉小工具,并对其进行空间排布以避免过早的特征融合,\tool{}迫使恶意逻辑仅在后期推理过程中显现。这有效绕过了感知层面的安全对齐。我们在SafeBench和MM-SafetyBench基准上评估\tool{},涵盖包括GPT-4o和Claude 3.7 Sonnet在内的7个前沿LVLMs。实验结果表明,\tool{}能持续突破安全对齐机制,在开源模型上平均超越现有最强基线4.67\%,在商业模型上平均超越9.50\%。