Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel \textbf{\textit{training-free}} compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at https://github.com/eric-ai-lab/ComCLIP.
翻译:对比语言-图像预训练(CLIP)已在图像与文本匹配任务中展现出强大的零样本性能。然而,将CLIP等视觉语言预训练模型应用于组合式图像与文本匹配仍具挑战性——该任务要求模型理解组合词概念与视觉组件,是一类更具难度的图像文本匹配任务。为实现零样本图像文本匹配中更优的组合泛化能力,本文从因果视角对该问题展开研究:个体实体的错误语义本质上是导致匹配失败的混杂因子。为此,我们提出一种新颖的\emph{无需训练}的组合式CLIP模型(ComCLIP)。ComCLIP将输入图像解耦为主体、客体与动作子图像,并通过组合CLIP的视觉编码器与文本编码器,在组合式文本嵌入与子图像嵌入间执行渐进式匹配。通过这种方式,ComCLIP可以缓解预训练CLIP模型引入的虚假关联,并动态评估各组件的重要性。在SVO、ComVG、Winoground、VL-checklist四个组合式图像文本匹配数据集,以及Flick30K、MSCOCO两个通用图像文本检索数据集上的实验表明,这种即插即用方法即使无需额外训练或微调,也能有效提升CLIP、SLIP和BLIP2的零样本推理能力。我们的代码可在https://github.com/eric-ai-lab/ComCLIP获取。