Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.
翻译:尽管大视觉语言模型(LVLMs)展现出强大的多模态性能,但其在微调过程中易受后门攻击的威胁。攻击者通过将嵌入触发器的样本注入训练数据,植入可在测试时被恶意激活的行为。现有防御方法通常依赖使用干净数据重新训练被植入后门的参数(例如适配器或LoRA模块),这种方法计算成本高昂且常导致模型性能下降。在本研究中,我们对LVLMs中的后门行为提出了新的机制性理解:触发器并非通过低级视觉模式影响预测,而是通过异常的跨模态注意力重分配发挥作用——携带触发器的视觉令牌会从文本上下文中窃取注意力,我们将此现象称为注意力窃取。基于此发现,我们提出了CleanSight,一种无需训练、即插即用的防御方案,完全在测试阶段运行。CleanSight(i)基于选定跨模态融合层中的视觉-文本注意力相对比例检测中毒输入,并(ii)通过选择性剪枝可疑的高注意力视觉令牌来净化输入,从而中和后门激活。大量实验表明,CleanSight在不同数据集和多种后门攻击类型上均显著优于现有的基于像素的净化防御方法,同时在干净样本和中毒样本上均能保持模型的实用性。