CLIP models have recently shown to exhibit Out of Distribution (OoD) generalization capabilities. However, Compositional Out of Distribution (C-OoD) generalization, which is a crucial aspect of a model's ability to understand unseen compositions of known concepts, is relatively unexplored for the CLIP models. Our goal is to address this problem and identify the factors that contribute to the C-OoD in CLIPs. We noted that previous studies regarding compositional understanding of CLIPs frequently fail to ensure that test samples are genuinely novel relative to the CLIP training data. To this end, we carefully synthesized a large and diverse dataset in the single object setting, comprising attributes for objects that are highly unlikely to be encountered in the combined training datasets of various CLIP models. This dataset enables an authentic evaluation of C-OoD generalization. Our observations reveal varying levels of C-OoD generalization across different CLIP models. We propose that the disentanglement of CLIP representations serves as a critical indicator in this context. By utilizing our synthesized datasets and other existing datasets, we assess various disentanglement metrics of text and image representations. Our study reveals that the disentanglement of image and text representations, particularly with respect to their compositional elements, plays a crucial role in improving the generalization of CLIP models in out-of-distribution settings. This finding suggests promising opportunities for advancing out-of-distribution generalization in CLIPs.
翻译:CLIP模型近期展现出超出分布(OoD)的泛化能力。然而,组合式超出分布(C-OoD)泛化——即模型理解已知概念未见组合的关键能力——在CLIP模型中的研究尚不充分。本研究旨在探究这一问题,并识别影响CLIP模型C-OoD泛化的关键因素。我们注意到,先前关于CLIP组合理解能力的研究往往未能确保测试样本相对于CLIP训练数据具有真正的新颖性。为此,我们在单物体设定下精心构建了一个大规模多样化数据集,其中包含的物体属性极不可能出现在各类CLIP模型的联合训练数据中。该数据集为C-OoD泛化能力提供了真实有效的评估基准。实验观察表明,不同CLIP模型展现出差异化的C-OoD泛化水平。我们提出CLIP表征的解缠程度可作为衡量该能力的关键指标。通过利用合成数据集及现有数据集,我们系统评估了文本与图像表征的多种解缠度量。研究发现,图像与文本表征(特别是其组合要素)的解缠程度,对提升CLIP模型在超出分布场景下的泛化能力具有决定性作用。这一发现为推进CLIP模型的超出分布泛化研究开辟了新的路径。