AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNN-based audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover, enforcing stealthiness constraints on adversarial audio waveforms introduces a reduced, non-convex feasible solution space, further intensifying the challenges of the optimization process. To overcome these challenges, we develop AdvWave, the first jailbreak framework against LALMs. We propose a dual-phase optimization method that addresses gradient shattering, enabling effective end-to-end gradient-based optimization. Additionally, we develop an adaptive adversarial target search algorithm that dynamically adjusts the adversarial optimization target based on the response patterns of LALMs for specific queries. To ensure that adversarial audio remains perceptually natural to human listeners, we design a classifier-guided optimization approach that generates adversarial noise resembling common urban sounds. Extensive evaluations on multiple advanced LALMs demonstrate that AdvWave outperforms baseline methods, achieving a 40% higher average jailbreak attack success rate.

翻译：近年来，大型音频-语言模型（LALMs）的进展使得基于语音的用户交互成为可能，显著提升了用户体验，并加速了LALMs在实际应用中的部署。然而，确保LALMs的安全性至关重要，以防止可能引发社会担忧或违反AI法规的风险性输出。尽管这一问题十分重要，但由于LALMs近期才兴起，且与针对基于DNN的音频模型的攻击相比存在额外的技术挑战，针对LALMs越狱攻击的研究仍然有限。具体而言，LALMs中的音频编码器涉及离散化操作，常导致梯度破碎，从而削弱了依赖基于梯度的优化方法的攻击效果。LALMs的行为变异性进一步增加了识别有效（对抗性）优化目标的难度。此外，在对抗性音频波形上施加隐蔽性约束会引入一个缩减的、非凸的可行解空间，进一步加剧了优化过程的挑战。为克服这些挑战，我们开发了AdvWave，首个针对LALMs的越狱攻击框架。我们提出了一种双阶段优化方法，以应对梯度破碎问题，实现有效的端到端基于梯度的优化。此外，我们开发了一种自适应对抗目标搜索算法，该算法能根据LALMs对特定查询的响应模式动态调整对抗优化目标。为确保对抗性音频在感知上对人类听者保持自然，我们设计了一种分类器引导的优化方法，以生成类似于常见城市声音的对抗性噪声。在多个先进LALMs上的广泛评估表明，AdvWave优于基线方法，实现了平均越狱攻击成功率提升40%。