Large Audio-Language Models (LALMs) are becoming essential as a powerful multimodal backbone for real-world applications. However, recent studies show that audio inputs can more easily elicit harmful responses than text, exposing new risks toward deployment. While safety alignment has made initial advances in LLMs and Large Vision-Language Models (LVLMs), we find that vanilla adaptation of these approaches to LALMs faces two key limitations: 1) LLM-based steering fails under audio input due to the large distributional gap between activations, and 2) prompt-based defenses induce over-refusals on benign-speech queries. To address these challenges, we propose Safe-Ablated Refusal Steering (SARSteer), the first inference-time defense framework for LALMs. Specifically, SARSteer leverages text-derived refusal steering to enforce rejection without manipulating audio inputs and introduces decomposed safe-space ablation to mitigate over-refusal. Extensive experiments demonstrate that SARSteer significantly improves harmful-query refusal while preserving benign responses, establishing a principled step toward safety alignment in LALMs. The codes and constructed datasets are released at https://github.com/linweiii/SARSteer.
翻译:大型音频语言模型(LALMs)正成为现实世界应用中不可或缺的强大多模态骨干。然而,近期研究表明,音频输入比文本更容易诱发有害响应,为部署带来了新的风险。尽管安全对齐在LLMs和大型视觉语言模型(LVLMs)方面已取得初步进展,但我们发现,将这些方法直接适配到LALMs存在两个关键限制:1)基于LLM的引导在音频输入下因激活分布差异较大而失效;2)基于提示的防御对良性语音查询引发过度拒绝。为解决这些挑战,我们提出了安全消融拒绝引导(SARSteer),这是首个面向LALMs的推理时防御框架。具体而言,SARSteer利用文本衍生的拒绝引导在不操控音频输入的情况下强制执行拒绝,并引入分解安全空间消融以缓解过度拒绝。大量实验表明,SARSteer在显著提升有害查询拒绝率的同时保持了良性响应的质量,为LALMs的安全对齐迈出了原则性的一步。代码和构建的数据集已发布在https://github.com/linweiii/SARSteer。