Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
翻译:当代大规模视觉语言模型(VLM)展现出强大的表征能力,使其在提升图文理解任务中广泛应用。这类模型通常通过对比学习方式,在从互联网采集的大规模多样化图像与对应文本描述数据集上进行训练。然而,VLM在执行需要精细理解物体及其属性复杂交互的组合推理任务时仍存在显著缺陷。该失效机制可归结为两个核心因素:1)传统对比方法依赖于从现有数据集中挖掘负样本,但被挖掘的负样本可能对模型区分正样本而言缺乏挑战性。替代挖掘方案的方式是生成负样本;2)现有生成方法主要针对给定图像生成困难负文本,而反向挖掘——即为给定文本生成困难负图像样本——尚未得到充分探索。为克服上述双重局限,我们提出一个框架,既能实现双向挖掘,又能生成跨模态(即图像与文本)的挑战性负样本。通过利用这些生成式困难负样本,我们显著提升了VLM在多模态组合推理任务中的性能。我们的代码与数据集已发布于https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html。