Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS <= 0.08, SI-SNR >= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving >97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.
翻译:集成音频、视觉和语言的多模态基础模型在推理与生成任务上展现出强大性能,但其对抗操纵鲁棒性仍未得到充分理解。本研究探讨一种现实且研究不足的威胁模型:针对音频-视频-语言三模态模型的非定向纯音频对抗攻击。我们分析了六种互补的攻击目标,这些目标针对多模态处理的不同阶段,包括音频编码器表征、跨模态注意力机制、隐藏状态及输出似然。在三个前沿模型和多个基准测试中,我们证明纯音频扰动可引发严重的多模态失效,攻击成功率最高可达96%。进一步研究表明,攻击在低感知失真条件下(LPIPS ≤ 0.08,SI-SNR ≥ 0)仍可成功,且延长优化时间比增加数据规模能带来更大收益。攻击在模型与编码器间的可迁移性有限,而如Whisper等语音识别系统主要对扰动幅度敏感,在严重失真条件下攻击成功率超过97%。这些结果揭示了多模态系统中先前被忽视的单模态攻击面,并为强化跨模态一致性的防御机制提供了理论依据。