AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they exclusively focused on the attack scenario where the adversary can fully manipulate user prompts (named strong adversary) and limited in effectiveness, applicability, and practicability. In this work, we first conduct an extensive evaluation showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to-speech (TTS) techniques. We then propose AUDIOJAILBREAK, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audios do not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into the perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios is concealed by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating reverberation into the perturbation generation. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, and/or over-the-air robustness. Moreover, AUDIOJAILBREAK is also applicable to a more practical and broader attack scenario where the adversary cannot fully manipulate user prompts (named weak adversary). Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AUDIOJAILBREAK, in particular, it can jailbreak openAI's GPT-4o-Audio and bypass Meta's Llama-Guard-3 safeguard, in the weak adversary scenario. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their robustness, especially for the newly proposed weak adversary.

翻译：针对大型音频语言模型（LALMs）的越狱攻击近期已得到研究，但这些研究完全集中于攻击者能够完全操控用户提示（称为强攻击者）的攻击场景，且在有效性、适用性和实用性方面存在局限。在本工作中，我们首先进行了广泛评估，结果表明先进的文本越狱攻击无法通过文本转语音（TTS）技术轻易移植到端到端LALMs。随后，我们提出了AUDIOJAILBREAK，一种新颖的音频越狱攻击，其特点在于：（1）异步性：通过构建后缀型越狱音频，越狱音频无需在时间轴上与用户提示对齐；（2）通用性：通过将多个提示纳入扰动生成过程，单个越狱扰动即可对不同提示有效；（3）隐蔽性：通过提出多种意图隐藏策略，掩盖越狱音频的恶意意图；（4）空中传输鲁棒性：通过将混响效应纳入扰动生成，越狱音频在通过空中播放时仍保持有效。相比之下，所有先前的音频越狱攻击均无法同时提供异步性、通用性、隐蔽性和/或空中传输鲁棒性。此外，AUDIOJAILBREAK也适用于攻击者无法完全操控用户提示（称为弱攻击者）这一更实用且更广泛的攻击场景。迄今为止对最多LALMs进行的广泛实验证明了AUDIOJAILBREAK的高效性，特别是在弱攻击者场景下，它能够成功越狱OpenAI的GPT-4o-Audio并绕过Meta的Llama-Guard-3安全防护。我们强调，本工作初步揭示了针对LALMs的音频越狱攻击所蕴含的安全隐患，并切实推动了其鲁棒性的提升，尤其是针对新提出的弱攻击者场景。