Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

Modern Large audio-language models (LALMs) power intelligent voice interactions by tightly integrating audio and text. This integration, however, expands the attack surface beyond text and introduces vulnerabilities in the continuous, high-dimensional audio channel. While prior work studied audio jailbreaks, the security risks of malicious audio injection and downstream behavior manipulation remain underexamined. In this work, we reveal a previously overlooked threat, auditory prompt injection, under realistic constraints of audio data-only access and strong perceptual stealth. To systematically analyze this threat, we propose \textit{AudioHijack}, a general framework that generates context-agnostic and imperceptible adversarial audio to hijack LALMs. \textit{AudioHijack} employs sampling-based gradient estimation for end-to-end optimization across diverse models, bypassing non-differentiable audio tokenization. Through attention supervision and multi-context training, it steers model attention toward adversarial audio and generalizes to unseen user contexts. We also design a convolutional blending method that modulates perturbations into natural reverberation, making them highly imperceptible to users. Extensive experiments on 13 state-of-the-art LALMs show consistent hijacking across 6 misbehavior categories, achieving average success rates of 79\%-96\% on unseen user contexts with high acoustic fidelity. Real-world studies demonstrate that commercial voice agents from Mistral AI and Microsoft Azure can be induced to execute unauthorized actions on behalf of users. These findings expose critical vulnerabilities in LALMs and highlight the urgent need for dedicated defense.

翻译：现代大型音频-语言模型（LALMs）通过紧密整合音频与文本来驱动智能语音交互。然而，这种整合在文本之外扩展了攻击面，并在连续、高维的音频通道中引入了漏洞。尽管已有研究探讨了音频越狱，但恶意音频注入与下游行为操纵的安全风险仍未得到充分审视。在本工作中，我们揭示了一种此前被忽视的威胁——听觉提示注入，该威胁在仅能访问音频数据且需高度感知隐蔽性的现实约束下成立。为系统性分析此威胁，我们提出 *AudioHijack* 通用框架，可生成上下文无关且不易察觉的对抗性音频以劫持LALMs。*AudioHijack* 采用基于采样的梯度估计方法，在无需可微音频分词化的条件下实现跨多样模型的端到端优化。通过注意力监督与多上下文训练，该框架将模型注意力导向对抗性音频，并泛化至未见过的用户上下文。我们还设计了一种卷积混合方法，将扰动调制为自然混响效应，使其对用户高度不可感知。针对13个最先进LALMs的广泛实验表明，该框架在6种异常行为类别中实现了一致劫持，在未见用户上下文上达到79%-96%的平均成功率，且保持高听觉保真度。真实场景研究证实，来自Mistral AI与Microsoft Azure的商业语音代理可被诱导代表用户执行未授权操作。这些发现暴露了LALMs中的关键漏洞，凸显了针对性防御的迫切需求。