Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence. Existing methods often fail under strong detection or continuous fine-tuning, largely due to (1) cross-modal inconsistency that exposes trigger patterns and (2) gradient dilution at low poisoning rates that accelerates backdoor forgetting. These coupled causes remain insufficiently modeled and addressed. We propose BadCLIP++, a unified framework that tackles both challenges. For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions, preserving clean-data statistics while producing compact trigger distributions. We further apply target-aligned subset selection to strengthen signals at low injection rates. For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment, and stabilize model parameters through curvature control and elastic weight consolidation, maintaining solutions within a low-curvature wide basin resistant to fine-tuning. We also provide the first theoretical analysis showing that, within a trust region, gradients from clean fine-tuning and backdoor objectives are co-directional, yielding a non-increasing upper bound on attack success degradation. Experiments demonstrate that with only 0.3% poisoning, BadCLIP++ achieves 99.99% attack success rate (ASR) in digital settings, surpassing baselines by 11.4 points. Across nineteen defenses, ASR remains above 99.90% with less than 0.8% drop in clean accuracy. The method further attains 65.03% success in physical attacks and shows robustness against watermark removal defenses.
翻译:针对多模态对比学习模型的后门攻击研究面临两大关键挑战:隐蔽性与持久性。现有方法在强检测或持续微调下常告失效,主要原因在于:(1)跨模态不一致性暴露了触发模式;(2)低投毒率下的梯度稀释加速了后门遗忘。这些耦合的成因尚未得到充分建模与解决。本文提出BadCLIP++统一框架以应对双重挑战。在隐蔽性方面,我们设计了语义融合QR微触发器,将不可感知的模式嵌入任务相关区域附近,在保持干净数据统计特性的同时生成紧凑的触发分布;进一步采用目标对齐子集选择策略以增强低注入率下的信号强度。在持久性方面,通过半径收缩与质心对齐稳定触发嵌入,并借助曲率控制与弹性权重巩固技术稳定模型参数,使解保持在低曲率宽盆地中以抵抗微调干扰。我们首次给出理论分析表明:在信任区域内,来自干净数据微调与后门目标的梯度具有同向性,从而推导出攻击成功率衰减的非递增上界。实验表明,仅需0.3%的投毒率,BadCLIP++在数字场景中即可实现99.99%的攻击成功率(ASR),较基线方法提升11.4个百分点。在十九种防御方法测试中,ASR始终维持在99.90%以上,且干净准确率下降不足0.8%。该方法在物理攻击中进一步达到65.03%的成功率,并对水印去除防御展现出强鲁棒性。