Current Vision and Language Models (VLMs) demonstrate strong performance across various vision-language tasks, yet they struggle with fine-grained understanding. This issue stems from weak image-caption alignment in pretraining datasets and a simplified contrastive objective that fails to distinguish nuanced grounding elements such as relations, actions, and attributes. As a result, the models tend to learn bag-of-words representations. To mitigate these challenges, we introduce an intra-modal contrastive loss and a unique cross-modal rank loss with an adaptive threshold that serves as curriculum learning, utilizing our automatically generated hard negatives to augment the model's capacity. Our strategy, which does not necessitate additional annotations or parameters, can be incorporated into any VLM trained with an image-text contrastive loss. Upon application to CLIP, our method leads to significant improvements on four fine-grained benchmarks, and it also enhances the performance of X-VLM, which is the state-of-art moodel on fine-grained reasoning.
翻译:当前视觉与语言模型(VLMs)在各类视觉-语言任务中展现出强大性能,但在细粒度理解方面仍存在困难。这一问题源于预训练数据集中图像-文本对齐的薄弱性,以及简化对比学习目标难以区分关系、动作和属性等精细化要素。因此,模型倾向于学习词袋式表征。为应对这些挑战,我们引入了一种模态内对比损失和一种具有自适应阈值的跨模态排序损失——该阈值作为课程学习机制,利用自动生成的硬负样本来增强模型能力。我们的策略无需额外标注或参数,可无缝集成至任何采用图像-文本对比损失训练的VLM中。将本方法应用于CLIP后,模型在四项细粒度基准测试上取得显著提升;同时,它也改进了当前细粒度推理领域最先进的X-VLM模型的性能。