Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while preserving overall semantic representations. Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This design avoids large-batch contrastive learning and additional momentum encoders, enabling robust training with low computational overhead. We evaluate Sim-CLIP across multiple Vision-Language Models and tasks under both targeted and untargeted adversarial attacks. Experimental results demonstrate that Sim-CLIP consistently outperforms state-of-the-art robust CLIP variants, achieving stronger adversarial robustness while maintaining or improving semantic fidelity. These findings highlight the limitations of existing adversarial defenses and establish Sim-CLIP as an effective and scalable solution for robust vision-language representation learning.
翻译:视觉-语言模型(VLM)严重依赖预训练的视觉编码器来支持图像描述、视觉问答和零样本分类等下游任务。尽管这些编码器性能优异,但它们对难以察觉的对抗性扰动高度脆弱,这可能会严重削弱多模态推理中的鲁棒性和语义质量。本文提出Sim-CLIP,一种无监督对抗微调框架,旨在增强CLIP视觉编码器的鲁棒性,同时保持整体语义表征。Sim-CLIP采用孪生训练架构,结合余弦相似度目标与对称停止梯度机制,以强制对齐干净视图与对抗视图之间的语义一致性。该设计避免了大规模批对比学习和额外的动量编码器,从而在低计算开销下实现鲁棒训练。我们在多种视觉-语言模型和任务上,针对有目标与无目标对抗攻击进行了评估。实验结果表明,Sim-CLIP始终优于最先进的鲁棒CLIP变体,在保持或改进语义保真度的同时实现了更强的对抗鲁棒性。这些发现揭示了现有对抗防御方法的局限性,并将Sim-CLIP确立为用于鲁棒视觉-语言表征学习的有效且可扩展的解决方案。