Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality alignment. However, whether such alignment inadvertently facilitates the transfer of safety vulnerabilities across modalities remains underexplored. This question is critical as text-based jailbreak attacks are considerably more mature than audio-based ones; if they transfer systematically, current audio safety evaluations may underestimate risks originating from the text modality. In this paper, we introduce the Alignment Curse, a formally characterized and empirically validated principle showing that stronger modality alignment enables more effective transfer of attacks from text to audio, revealing a fundamental tension between capability and safety. Motivated by this principle, we conduct a comprehensive black-box evaluation of three attack categories on recent omni-models (e.g., Qwen2.5-Omni, Qwen3-Omni): text attacks, text-transferred audio attacks, and audio attacks. We find that text-transferred audio attacks perform comparably to, and often better than, audio-based attacks, exhibiting a clear advantage under audio-only access. This suggests that text-based vulnerabilities play a pivotal role in shaping audio safety risks. Finally, we empirically analyze the relationship between modality alignment and transfer effectiveness across attack methods and models, observing consistent support for the Alignment Curse: tighter modality alignment leads to more effective cross-modality attack transfer.
翻译:近期端到端训练的全模态模型通过强化文本-音频模态对齐显著提升了音频能力。然而,这种对齐是否会无意间促进安全漏洞跨模态迁移仍属未充分探索领域。该问题至关重要,因为基于文本的越狱攻击远比音频攻击成熟;若此类攻击系统性迁移,当前音频安全评估可能低估源于文本模态的风险。本文提出"对齐诅咒"这一经形式化表征与实证验证的原理,揭示更强的模态对齐会促使文本到音频的攻击迁移更高效,由此呈现能力与安全间的根本矛盾。基于该原理,我们对近期全模态模型(如Qwen2.5-Omni、Qwen3-Omni)开展三类攻击的综合黑盒评估:文本攻击、文本迁移音频攻击及音频攻击。研究发现文本迁移音频攻击性能可与音频攻击媲美甚至更优,在纯音频访问场景下表现出显著优势,表明文本层面漏洞在塑造音频安全风险中起关键作用。最后,我们通过跨攻击方法与模型的实证分析,揭示模态对齐与迁移效率的关系,一致支持"对齐诅咒":更紧密的模态对齐导致更有效的跨模态攻击迁移。