Large Audio Language Models (LALMs) have made significant progress. While increasingly deployed in real-world applications, LALMs face growing safety risks from jailbreak attacks that bypass safety alignment. However, there remains a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare jailbreak attacks against them. To address this gap, we introduce JALMBench, a comprehensive benchmark that assesses LALM safety against jailbreak attacks, comprising 11,316 text samples and 245,355 audio samples (>1,000 hours). JALMBench supports 12 mainstream LALMs, 8 attack methods (4 text-transferred and 4 audio-originated), and 5 defenses. We conduct in-depth analysis on attack efficiency, topic sensitivity, voice diversity, and model architecture. Additionally, we explore mitigation strategies for the attacks at both the prompt and response levels. Our systematic evaluation reveals that LALMs' safety is strongly influenced by modality and architectural choices: text-based safety alignment can partially transfer to audio inputs, and interleaved audio-text strategies enable more robust cross-modal generalization. Existing general-purpose moderation methods only slightly improve security, highlighting the need for defense methods specifically designed for LALMs. We hope our work can shed light on the design principles for building more robust LALMs.
翻译:大型音频语言模型(LALMs)已取得显著进展。随着其在现实世界应用中的日益普及,LALMs面临着来自越狱攻击日益增长的安全风险,这些攻击会绕过安全对齐机制。然而,目前仍缺乏专门用于评估和比较针对LALMs的越狱攻击的对抗性音频数据集和统一框架。为填补这一空白,我们提出了JALMBench——一个全面评估LALMs抵御越狱攻击安全性的基准测试,包含11,316个文本样本和245,355个音频样本(>1,000小时)。JALMBench支持12个主流LALMs、8种攻击方法(4种文本迁移型和4种音频原生型)以及5种防御策略。我们对攻击效率、主题敏感性、语音多样性和模型架构进行了深入分析。此外,我们探索了在提示层面和响应层面对攻击的缓解策略。我们的系统评估表明,LALMs的安全性受模态和架构选择的显著影响:基于文本的安全对齐可部分迁移至音频输入,而交错式音频-文本策略能实现更稳健的跨模态泛化。现有的通用审核方法仅能轻微提升安全性,这凸显了专门为LALMs设计防御方法的必要性。我们希望本工作能为构建更鲁棒的LALMs提供设计原则的启示。