Recent progress in LLMs has enabled understanding of audio signals, but has also exposed new safety risks arising from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition to enable effective black-box attacks. SACRED-Bench adopts three composition mechanisms: (a) overlap of harmful and benign speech, (b) mixture of benign speech with harmful non-speech audio, and (c) multi-speaker dialogue. These mechanisms focus on evaluating safety in settings where benign and harmful intents co-occur within a single auditory scene. Moreover, questions in SACRED-Bench are designed to implicitly refer to content in the audio, such that no explicit harmful information appears in the text prompt alone. Experiments demonstrate that even Gemini 2.5 Pro, a state-of-the-art proprietary LLM with safety guardrails fully enabled, still exhibits a 66% attack success rate. To bridge this gap, we propose SALMONN-Guard, the first guard model that jointly inspects speech, audio, and text for safety judgments, reducing the attack success rate to 20%. Our results highlight the need for audio-aware defenses to ensure the safety of multimodal LLMs. The dataset and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench.
翻译:近年来,大语言模型在音频信号理解方面取得的进展,也暴露出当前安全机制难以妥善处理复杂音频输入所引发的新型安全风险。本文提出SACRED-Bench(语音-音频组合红队测试基准),用于评估大语言模型在复杂音频攻击下的鲁棒性。与现有依赖噪声优化或白盒访问的扰动方法不同,SACRED-Bench通过语音-音频组合实现有效的黑盒攻击。该基准采用三种组合机制:(a)有害语音与良性语音的重叠;(b)良性语音与有害非语音音频的混合;(c)多说话人对话。这些机制专注于评估单一听觉场景中良性意图与有害意图共存时的安全性。此外,SACRED-Bench中的问题设计隐含指代音频内容,使得文本提示本身不出现显性有害信息。实验表明,即使是配备完整安全防护机制的最新专有大语言模型Gemini 2.5 Pro,其攻击成功率仍高达66%。为弥补这一缺陷,我们提出首个联合检测语音、音频与文本的安全防护模型SALMONN-Guard,将攻击成功率降低至20%。研究结果凸显了构建音频感知防御机制对保障多模态大语言模型安全的重要性。数据集与SALMONN-Guard检查点可通过https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench获取。