Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.
翻译:大型音频-语言模型日益直接处理原始语音输入,使其在语音助手、教育及临床分诊等领域的集成更为无缝。然而,这一模态转换引入了一类尚未被充分表征的独特安全漏洞。本研究通过设计一种将违规指令嵌入叙事风格音频流的文本转音频越狱攻击,深入探讨该模态转换的安全隐患。该攻击利用先进的指令跟随型文本转语音模型,通过结构性与声学特性的双重操纵,成功规避了主要针对文本校准的安全防护机制。当通过合成语音传递时,叙事格式能有效诱使包括Gemini 2.0 Flash在内的前沿模型输出受限内容,攻击成功率高达98.26%,显著超越纯文本基线。这些发现凸显了构建能同时处理语言与副语言表征的安全框架的迫切需求,尤其在语音交互界面日益普及的背景下。