Vision-language models, such as CLIP, have shown promising Out-of-Distribution (OoD) generalization under various types of distribution shifts. Recent studies attempted to investigate the leading cause of this capability. In this work, we follow the same path, but focus on a specific type of OoD data - images with novel compositions of attribute-object pairs - and study whether such models can successfully classify those images into composition classes. We carefully designed an authentic image test dataset called ImageNet-AO, consisting of attributes for objects that are unlikely encountered in the CLIP training sets. We found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization compared to both supervised models and CLIPs trained with smaller datasets, such as CC-12M and YFCC-15M. Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models.
翻译:视觉语言模型(如CLIP)在多种分布偏移场景下展现出卓越的分布外(OoD)泛化能力。近期研究尝试探究这一能力的主导成因。本文沿袭相同思路,但聚焦于特定类型的OoD数据——包含属性-对象新颖组合的图像,并考察此类模型能否成功将图像分类至组合类别。我们精心构建了名为ImageNet-AO的真实图像测试集,其中包含CLIP训练集中罕见出现的对象属性组合。研究发现,经大规模数据集(如OpenAI CLIP、LAION-400M和LAION-2B)训练的CLIP模型,其有效组合性OoD泛化能力较监督模型及小规模数据集(如CC-12M和YFCC-15M)训练的CLIP模型呈现数量级提升。实验结果证明,训练数据的规模与多样性以及语言监督在激发视觉语言模型组合泛化能力中起关键作用。