Do CLIPs Always Generalize Better than ImageNet Models?

Large vision language models, such as CLIPs, have revolutionized modern machine learning. CLIPs have demonstrated great generalizability under distribution shifts, supported by an increasing body of literature. However, the evaluation datasets for CLIPs are variations primarily designed for ImageNet benchmarks, which may not fully reflect the extent to which CLIPs, e.g., pre-trained on LAION, robust to spurious correlations. To bridge the gap, we collect a real-world dataset called CounterAnimal that contains realistic spurious features found in animal photos. CounterAnimal consists of a) the common group: comprising animals on common backgrounds, and b) the counter group: including animals on unusual backgrounds. The performance drops from the common to counter groups quantify the reliance of models on spurious features (i.e., backgrounds) to predict the animals. We find that CLIPs trained on either LAION or the OpenAI data exhibit notable performance drops on the counter group. Surprisingly, we observe that single-modal models trained on ImageNet are more robust than CLIPs. We provide both theoretical and empirical explanations for why CLIPs still learn spurious features. Our findings suggest that distribution shifts remain an open problem for CLIPs, and one needs to be cautious about test setups when evaluating foundation models pre-trained on a significantly different scale and distribution.

翻译：大型视觉语言模型，如CLIPs，已经彻底改变了现代机器学习。越来越多的文献表明，CLIPs在分布偏移下展现出强大的泛化能力。然而，用于评估CLIPs的数据集主要是针对ImageNet基准设计的变体，这可能无法全面反映CLIPs（例如，在LAION上预训练的模型）对虚假关联的鲁棒性。为填补这一空白，我们收集了一个名为CounterAnimal的真实世界数据集，其中包含动物照片中存在的现实虚假特征。CounterAnimal包括：a) 共同组：包含常见背景上的动物；b) 反例组：包含不寻常背景上的动物。从共同组到反例组的性能下降量化了模型依赖虚假特征（即背景）来预测动物的程度。我们发现，无论是在LAION还是OpenAI数据上训练的CLIPs，在反例组上都表现出显著的性能下降。令人惊讶的是，我们观察到在ImageNet上训练的单模态模型比CLIPs更鲁棒。我们从理论和实证两方面解释了为什么CLIPs仍然会学习虚假特征。我们的发现表明，分布偏移对CLIPs来说仍然是一个开放问题，在评估预训练于显著不同规模和分布的基础模型时，需要对测试设置保持谨慎。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/