As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed subtle perturbations into benign audio carriers, such as weather queries or greeting messages. Our method achieves average attack success rates of 60-78% across two benchmarks and five multimodal LLMs, validated by multiple evaluation frameworks. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating multimodal AI systems.
翻译:随着大型语言模型(LLMs)日益融入日常生活,音频已成为人机交互的关键界面。然而,这种便利性也带来了新的安全漏洞,使音频成为潜在的攻击媒介。本研究提出WhisperInject——一种两阶段对抗性音频攻击框架,能够操纵前沿的音频语言模型生成有害内容。该方法将有害载荷以人耳难以察觉的细微扰动形式嵌入可理解的音频输入中。第一阶段采用新颖的基于奖励的白盒优化方法“基于投影梯度下降的强化学习(RL-PGD)”,破解目标模型并诱导其生成原生有害响应。该原生有害响应随后作为第二阶段“载荷注入”的目标,通过基于梯度的优化将细微扰动嵌入良性音频载体(如天气查询或问候消息)。本方法在两种基准测试和五个多模态LLMs上实现了60-78%的平均攻击成功率,并经过多重评估框架验证。本研究揭示了一类新型的实用化音频原生威胁,超越了理论漏洞利用的范畴,展示了一种可行且隐蔽的多模态AI系统操纵方法。