ALMGuard：作为音频-语言模型护栏的安全捷径及其发现方法 (ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models)

Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically target ALMs, revealing that defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose ALMGuard, the first defense framework tailored to ALMs. Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, we design a method to identify universal Shortcut Activation Perturbations (SAPs) that serve as triggers that activate the safety shortcuts to safeguard ALMs at inference time. To better sift out effective triggers while preserving the model's utility on benign tasks, we further propose Mel-Gradient Sparse Mask (M-GSM), which restricts perturbations to Mel-frequency bins that are sensitive to jailbreaks but insensitive to speech understanding. Both theoretical analyses and empirical results demonstrate the robustness of our method against both seen and unseen attacks. Overall, \MethodName reduces the average success rate of advanced ALM-specific jailbreak attacks to 4.6% across four models, while maintaining comparable utility on benign benchmarks, establishing it as the new state of the art. Our code and data are available at https://github.com/WeifeiJin/ALMGuard.

翻译：音频-语言模型（ALMs）的最新进展显著提升了多模态理解能力。然而，音频模态的引入也带来了新颖且独特的脆弱性向量。先前研究提出了专门针对ALMs的越狱攻击，表明直接沿用传统音频对抗攻击或基于文本的大语言模型（LLM）越狱的防御方法对这些ALM特有威胁基本无效。为解决此问题，我们提出了首个针对ALMs的防御框架ALMGuard。基于安全对齐捷径天然存在于ALMs中的假设，我们设计了一种识别通用捷径激活扰动（SAPs）的方法，这些扰动作为触发器可在推理时激活安全捷径以保护ALMs。为在保持模型良性任务性能的同时筛选有效触发器，我们进一步提出梅尔梯度稀疏掩码（M-GSM），将扰动限制在对越狱敏感但对语音理解不敏感的梅尔频率区间。理论分析与实证结果均表明，我们的方法对已知和未知攻击均具有鲁棒性。总体而言，\\MethodName 在四种模型上将先进ALM专用越狱攻击的平均成功率降至4.6%，同时在良性基准测试中保持可比性能，确立了新的技术标杆。我们的代码与数据公开于 https://github.com/WeifeiJin/ALMGuard。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日