This paper examines the robustness of a multi-modal computer vision model, CLIP (Contrastive Language-Image Pretraining), in the context of unsupervised learning. The main objective is twofold: first, to evaluate the robustness of CLIP, and second, to explore strategies for augmenting its robustness. To achieve this, we introduce a novel approach named LP-CLIP. This technique involves the distillation of CLIP features through the incorporation of a linear probing layer positioned atop its encoding structure. This newly added layer is trained utilizing pseudo-labels produced by CLIP, coupled with a self-training strategy. The LP-CLIP technique offers a promising approach to enhance the robustness of CLIP without the need for annotations. By leveraging a simple linear probing layer, we aim to improve the model's ability to withstand various uncertainties and challenges commonly encountered in real-world scenarios. Importantly, our approach does not rely on annotated data, which makes it particularly valuable in situations where labeled data might be scarce or costly to obtain. Our proposed approach increases the robustness of CLIP with SOTA results compared to supervised technique on various datasets.
翻译:本文研究了多模态计算机视觉模型CLIP(对比语言-图像预训练)在无监督学习场景下的鲁棒性问题。主要目标有二:首先评估CLIP的鲁棒性,其次探索增强其鲁棒性的策略。为此,我们提出了一种名为LP-CLIP的新方法。该技术通过在其编码结构顶部引入线性探测层,对CLIP特征进行蒸馏。该新增层利用CLIP生成的伪标签,结合自训练策略进行训练。LP-CLIP技术提供了一种无需标注即可增强CLIP鲁棒性的有效途径。通过利用简单的线性探测层,旨在提升模型应对现实场景中常见不确定性及挑战的能力。值得注意的是,本方法不依赖标注数据,因此在标注数据稀缺或获取成本高昂的场景中具有特殊价值。实验表明,与基于监督学习的现有技术相比,我们的方法在多个数据集上取得了最优(SOTA)的鲁棒性提升结果。