Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose SPARK, a jailbreak framework that leverages T2V models cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a +23% improvement in average attack success rate in commercial models.
翻译:越狱攻击能够绕过模型的安全防护机制并揭示关键盲点。先前针对文本到视频(T2V)模型的攻击通常向明显不安全的提示添加对抗性扰动,这类方法往往易于检测和防御。与此相反,我们发现包含丰富隐性线索、看似良性的提示能够诱导T2V模型生成语义层面不安全且同时违反政策并保留原始(被屏蔽)意图的视频。为实现这一目标,我们提出SPARK——一种通过模块化提示设计、利用T2V模型跨模态关联模式的越狱框架。具体而言,我们的提示融合三个组件:中性场景锚点,其提供从被屏蔽意图中提取的表层场景描述以保持合理性;潜在听觉触发器,即对无害音频事件(如吱呀声、闷响)的文本描述,通过利用已学习的视听共现先验知识,使模型偏向特定的不安全视觉概念;以及风格调制器,即通过电影化指令(如镜头构图、氛围营造)来增强并稳定潜在触发器的效果。我们将攻击生成形式化为对上述模块化提示空间的约束优化问题,并通过平衡隐蔽性与有效性的引导搜索过程进行求解。在7个T2V模型上的大量实验证明了我们攻击方法的有效性,其在商业模型中的平均攻击成功率提升了+23%。