Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.
翻译:大型音频语言模型(LALMs)正越来越多地被部署于实际应用场景中,然而其安全对齐能力仍主要基于单语文本型有害提示进行评估。这导致其在多语言和语音场景(尤其是代码切换语音)下的泛化能力在很大程度上尚未得到充分探索。为填补这一空白,我们提出了SpeechJBB——一个用于对多种最先进LALM进行基准测试的音频越狱数据集。通过引入一种增强设置(即在安全关键术语周围插入语音学上合理的伪词以模拟局部混淆),我们进一步探究了安全弱点的严重程度。实验表明,跨模型而言,代码切换型有害音频能产生显著较高的越狱成功率(JSR),其中非英语单语和非英语代码切换对显示出最高的攻击成功率。伪词插入进一步降低了拒绝率,这表明类自然的混淆手段可有效绕过安全策略。