Cross-Modal Backdoors in Multimodal Large Language Models

Developers increasingly construct multimodal large language models (MLLMs) by assembling pretrained components,introducing supply-chain attack surfaces.Existing security research primarily focuses on poisoning backbones such as encoders or large language models (LLMs),while the security risks of lightweight connectors remain unexplored.In this work,we propose a novel cross-modal backdoor attack that exploits this overlooked vulnerability.By poisoning only the connector using a single seed sample and several augmented variants from one modality,the adversary can subsequently activate the backdoor using inputs from other modalities.To achieve this,we first poison the connector to associate a compact latent region with a malicious target output.To activate the backdoor from other modalities,we further extract a malicious centroid from the poisoned latent representations and perform input-side optimization to steer inputs toward this latent anchor,without requiring repeated API queries or full-model access.Extensive evaluations on representative connector-based MLLM architectures,including PandaGPT and NExT-GPT,demonstrate both the effectiveness and cross-modal transferability of the proposed attack.The attack achieves up to 99.9% attack success rate (ASR) in same-modality settings,while most cross-modal settings exceed 95.0% ASR under bounded perturbations.Moreover,the attack remains highly stealthy,producing negligible leakage on clean inputs,and maintaining weight-cosine similarity above 0.97 relative to benign connectors.We further show that existing defense strategies fail to effectively mitigate this threat without incurring substantial utility degradation.These findings reveal a fundamental vulnerability in multimodal alignment: a single compromised connector can establish a reusable latent-space backdoor pathway across modalities,highlighting the need for safer modular MLLM design.

翻译：开发者通过组装预训练组件构建多模态大型语言模型（MLLM）的趋势日益增长，由此引入了供应链攻击面。现有安全研究主要聚焦于污染编码器或大型语言模型（LLM）等骨干网络，而轻量级连接器的安全风险尚待探索。本文提出一种新颖的跨模态后门攻击，揭示了这一被忽视的漏洞。攻击者仅需通过单一样本及其来自同一模态的若干增强变体污染连接器，即可随后利用其他模态的输入激活后门。为此，我们首先污染连接器，将紧凑的潜在区域与恶意目标输出关联。为从其他模态激活后门，我们进一步从受污染的潜在表示中提取恶意质心，并通过输入侧优化引导输入趋向该潜在锚点，无需重复API查询或完整模型访问。在包括PandaGPT和NExT-GPT在内的代表性基于连接器的MLLM架构上进行的广泛评估，证明了所提攻击的有效性和跨模态可迁移性。在同模态设置下，攻击成功率（ASR）高达99.9%，而大多数跨模态设置在有限扰动下ASR超过95.0%。此外，该攻击具有高度隐蔽性：对干净输入产生可忽略的泄露，且与良性连接器相比，权重余弦相似度保持0.97以上。我们进一步表明，现有防御策略在避免显著效用损失的前提下，无法有效缓解这一威胁。这些发现揭示了多模态对齐中的根本性漏洞：单个受污染连接器即可建立跨模态的可重用潜在空间后门通路，凸显了更安全的模块化MLLM设计的必要性。