Large vision language models, such as CLIPs, have revolutionized modern machine learning. CLIPs have demonstrated great generalizability under distribution shifts, supported by an increasing body of literature. However, the evaluation datasets for CLIPs are variations primarily designed for ImageNet benchmarks, which may not fully reflect the extent to which CLIPs, e.g., pre-trained on LAION, robust to spurious correlations. To bridge the gap, we collect a real-world dataset called CounterAnimal that contains realistic spurious features found in animal photos. CounterAnimal consists of a) the common group: comprising animals on common backgrounds, and b) the counter group: including animals on unusual backgrounds. The performance drops from the common to counter groups quantify the reliance of models on spurious features (i.e., backgrounds) to predict the animals. We find that CLIPs trained on either LAION or the OpenAI data exhibit notable performance drops on the counter group. Surprisingly, we observe that single-modal models trained on ImageNet are more robust than CLIPs. We provide both theoretical and empirical explanations for why CLIPs still learn spurious features. Our findings suggest that distribution shifts remain an open problem for CLIPs, and one needs to be cautious about test setups when evaluating foundation models pre-trained on a significantly different scale and distribution.
翻译:大型视觉语言模型,如CLIPs,已经彻底改变了现代机器学习。越来越多的文献表明,CLIPs在分布偏移下展现出强大的泛化能力。然而,用于评估CLIPs的数据集主要是针对ImageNet基准设计的变体,这可能无法全面反映CLIPs(例如,在LAION上预训练的模型)对虚假关联的鲁棒性。为填补这一空白,我们收集了一个名为CounterAnimal的真实世界数据集,其中包含动物照片中存在的现实虚假特征。CounterAnimal包括:a) 共同组:包含常见背景上的动物;b) 反例组:包含不寻常背景上的动物。从共同组到反例组的性能下降量化了模型依赖虚假特征(即背景)来预测动物的程度。我们发现,无论是在LAION还是OpenAI数据上训练的CLIPs,在反例组上都表现出显著的性能下降。令人惊讶的是,我们观察到在ImageNet上训练的单模态模型比CLIPs更鲁棒。我们从理论和实证两方面解释了为什么CLIPs仍然会学习虚假特征。我们的发现表明,分布偏移对CLIPs来说仍然是一个开放问题,在评估预训练于显著不同规模和分布的基础模型时,需要对测试设置保持谨慎。