Multimodal large language models (MLLMs) have achieved remarkable progress, yet remain critically vulnerable to adversarial attacks that exploit weaknesses in cross-modal processing. We present a systematic study of multimodal jailbreaks targeting both vision-language and audio-language models, showing that even simple perceptual transformations can reliably bypass state-of-the-art safety filters. Our evaluation spans 1,900 adversarial prompts across three high-risk safety categories harmful content, CBRN (Chemical, Biological, Radiological, Nuclear), and CSEM (Child Sexual Exploitation Material) tested against seven frontier models. We explore the effectiveness of attack techniques on MLLMs, including FigStep-Pro (visual keyword decomposition), Intelligent Masking (semantic obfuscation), and audio perturbations (Wave-Echo, Wave-Pitch, Wave-Speed). The results reveal severe vulnerabilities: models with almost perfect text-only safety (0\% ASR) suffer >75\% attack success under perceptually modified inputs, with FigStep-Pro achieving up to 89\% ASR in Llama-4 variants. Audio-based attacks further uncover provider-specific weaknesses, with even basic modality transfer yielding 25\% ASR for technical queries. These findings expose a critical gap between text-centric alignment and multimodal threats, demonstrating that current safeguards fail to generalize across cross-modal attacks. The accessibility of these attacks, which require minimal technical expertise, suggests that robust multimodal AI safety will require a paradigm shift toward broader semantic-level reasoning to mitigate possible risks.
翻译:多模态大语言模型(MLLMs)已取得显著进展,但在利用跨模态处理弱点的对抗性攻击面前仍极为脆弱。本研究系统性地探讨了针对视觉-语言与音频-语言模型的多模态越狱攻击,表明即使简单的感知变换也能可靠地绕过最先进的安全过滤器。我们在三大高风险安全类别(有害内容、CBRN(化学、生物、放射、核)及CSEM(儿童性剥削材料))下,对七个前沿模型测试了1,900个对抗性提示。我们探究了多种攻击技术在MLLMs上的有效性,包括FigStep-Pro(视觉关键词分解)、智能掩码(语义混淆)以及音频扰动(Wave-Echo、Wave-Pitch、Wave-Speed)。结果揭示了严重的安全漏洞:在文本模式下安全性近乎完美(0%攻击成功率)的模型,在感知修改后的输入下攻击成功率超过75%,其中FigStep-Pro在Llama-4变体上攻击成功率高达89%。基于音频的攻击进一步暴露了供应商特定的弱点,即使是基本的模态迁移也能在技术查询上实现25%的攻击成功率。这些发现揭示了以文本为中心的对齐策略与多模态威胁之间的关键差距,表明当前的安全措施无法泛化至跨模态攻击。此类攻击所需技术门槛极低,其易实施性意味着要实现稳健的多模态AI安全,需要范式转变,转向更广泛的语义层面推理以缓解潜在风险。