Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.
翻译:视觉-语言模型(如CLIP)展现出强大的图像-文本理解能力,推动了下游任务的进展,包括零样本图像分类、图像-文本检索以及文本到图像生成。然而,现有视觉-语言模型的组合推理能力仍然不足。这一局限性的根源在于预训练数据集中图像与描述之间的对齐不够充分。此外,当前的对比学习目标未能聚焦于关系、动作和属性等细粒度语义成分,导致产生"词袋"式表征。我们提出了一种简单有效的方法来提升视觉-语言模型的组合推理能力。该方法通过改进和扩展标准图像-文本对比学习框架,更有效地利用现有数据集。我们的方法无需特定标注,且不引入额外参数。当与CLIP集成时,本技术在五个视觉-语言组合基准测试中相较于现有最优基线取得了显著提升。我们已在https://github.com/lezhang7/Enhance-FineGrained 开源代码。