No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

翻译：对比视觉语言（V&L）模型仍是各类应用中的主流选择。然而，若干局限性逐渐显现，最为显著的是V&L模型在学习组合性表征方面的能力有限。先前方法通常通过生成定制训练数据来获取硬负样本以解决此局限性。硬负样本已被证明可提升组合性任务的性能，但往往针对单一基准，缺乏泛化性，并可能导致基本V&L能力（如零样本或检索性能）显著下降，从而使其在实际中不可行。本研究采用不同路径：我们识别出限制V&L组合性性能的两个根本原因：1）长训练描述不需要组合性表征；2）文本和图像编码器中的最终全局池化导致学习绑定所必需的信息完全丢失。为此，我们提出两种简单解决方案：1）使用标准NLP软件获取短概念中心描述片段，并将其与图像对齐；2）引入无参数化的跨模态注意力池化，从图像编码器中获取概念中心视觉嵌入。通过这两项改进及简单的辅助对比损失，我们在标准组合性基准上取得最先进性能，同时保持或提升了强零样本与检索能力。此方法无需增加推理成本。我们已将代码开源至 https://github.com/SamsungLabs/concept_centric_clip。