A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable generalization capabilities across multiple challenging distribution shifts. However, there is still much to be explored in terms of their robustness to the variations of specific visual factors. In real-world applications, reliable and safe systems must consider other safety objectives beyond classification accuracy, such as predictive uncertainty. Yet, the effectiveness of CLIP models on such safety-related features is less-explored. Driven by the above, this work comprehensively investigates the safety objectives of CLIP models, specifically focusing on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs. To this end, we study 83 CLIP models and 127 ImageNet classifiers. They are diverse in architecture, (pre)training distribution and training strategies. We consider 10 visual factors (e.g., shape and pattern), 5 types of out-of-distribution data, and 8 natural and challenging test conditions with different shift types, such as texture, style, and perturbation shifts. Our study has unveiled several previously unknown insights into CLIP models. For instance, they are not consistently more calibrated than other ImageNet models, which contradicts existing findings. Additionally, our analysis underscores the significance of training source design by showcasing its profound influence on the three safety-related properties. We believe our comprehensive study can shed light on and help guide the development of more robust and reliable CLIP models.

翻译：对比语言-图像预训练（CLIP）模型在多种具有挑战性的分布偏移场景中展现出卓越的泛化能力。然而，针对特定视觉因子变化的鲁棒性仍有待深入探索。在实际应用中，可靠且安全的系统除分类准确率外，还需兼顾预测不确定性等其他安全目标，而CLIP模型在此类安全相关特性上的效能研究尚不充分。基于此，本研究系统性探讨了CLIP模型的安全目标，重点关注三个关键属性：对视觉因子变化的韧性、校准的不确定性估计以及异常输入检测能力。为此，我们研究了83个CLIP模型与127个ImageNet分类器，这些模型在架构、(预)训练分布及训练策略上具有多样性。我们考虑了10种视觉因子（如形状与图案）、5类分布外数据以及8种包含纹理、风格与扰动等不同偏移类型的自然且具有挑战性的测试条件。本研究发现若干关于CLIP模型此前未知的洞见，例如其校准性能并非始终优于其他ImageNet模型——这与既有结论相悖。此外，我们的分析通过揭示训练源设计对三种安全相关属性的深远影响，强调了其重要性。我们相信这项系统性研究可为开发更鲁棒、更可靠的CLIP模型提供启示与指导。