Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives--confidence uncertainty and out-of-distribution detection--beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.
翻译:对比性语言-图像预训练(CLIP)模型已展现出显著潜力,尤其在应对多样化分布偏移的零样本分类任务中。本研究在现有整体分类鲁棒性评估的基础上,通过引入多个新视角,旨在为CLIP模型提供更全面的评估框架。首先,我们探究模型对特定视觉因素变化的鲁棒性。其次,我们超越单纯分类准确率,评估两个关键安全目标——置信度不确定性与分布外检测能力。第三,我们评估CLIP模型融合图像与文本模态的精细程度。第四,我们将研究拓展至CLIP模型的三维感知能力,超越传统的二维图像理解范畴。最后,我们探索以CLIP作为视觉主干的大型多模态模型(LMMs)中视觉与语言编码器的交互机制,重点关注这种交互如何影响分类鲁棒性。在每个评估维度中,我们考察六类因素对CLIP模型的影响:模型架构、训练数据分布、训练集规模、微调策略、对比损失函数以及测试阶段提示策略。本研究揭示了关于CLIP模型的若干新发现:例如,CLIP中视觉编码器的架构对其抵抗三维数据扰动的鲁棒性具有显著影响;CLIP模型在预测时往往表现出对形状特征的偏好;这种偏好在ImageNet微调后趋于减弱;基于CLIP视觉编码器的视觉语言模型(如LLaVA)相较于独立CLIP模型,在困难类别分类任务中可能展现出性能优势。我们的研究结果有望为提升CLIP模型的鲁棒性与可靠性提供有价值的指导。