Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically target ALMs, revealing that defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose ALMGuard, the first defense framework tailored to ALMs. Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, we design a method to identify universal Shortcut Activation Perturbations (SAPs) that serve as triggers that activate the safety shortcuts to safeguard ALMs at inference time. To better sift out effective triggers while preserving the model's utility on benign tasks, we further propose Mel-Gradient Sparse Mask (M-GSM), which restricts perturbations to Mel-frequency bins that are sensitive to jailbreaks but insensitive to speech understanding. Both theoretical analyses and empirical results demonstrate the robustness of our method against both seen and unseen attacks. Overall, \MethodName reduces the average success rate of advanced ALM-specific jailbreak attacks to 4.6% across four models, while maintaining comparable utility on benign benchmarks, establishing it as the new state of the art. Our code and data are available at https://github.com/WeifeiJin/ALMGuard.
翻译:音频-语言模型(ALMs)的最新进展显著提升了多模态理解能力。然而,音频模态的引入也带来了新颖且独特的脆弱性向量。先前研究提出了专门针对ALMs的越狱攻击,表明直接沿用传统音频对抗攻击或基于文本的大语言模型(LLM)越狱的防御方法对这些ALM特有威胁基本无效。为解决此问题,我们提出了首个针对ALMs的防御框架ALMGuard。基于安全对齐捷径天然存在于ALMs中的假设,我们设计了一种识别通用捷径激活扰动(SAPs)的方法,这些扰动作为触发器可在推理时激活安全捷径以保护ALMs。为在保持模型良性任务性能的同时筛选有效触发器,我们进一步提出梅尔梯度稀疏掩码(M-GSM),将扰动限制在对越狱敏感但对语音理解不敏感的梅尔频率区间。理论分析与实证结果均表明,我们的方法对已知和未知攻击均具有鲁棒性。总体而言,\\MethodName 在四种模型上将先进ALM专用越狱攻击的平均成功率降至4.6%,同时在良性基准测试中保持可比性能,确立了新的技术标杆。我们的代码与数据公开于 https://github.com/WeifeiJin/ALMGuard。