Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has highlighted severe limitations of these models in their ability to perform compositional reasoning over objects, attributes, and relations. Scene graphs have emerged as an effective way to understand images compositionally. These are graph-structured semantic representations of images that contain objects, their attributes, and relations with other objects in a scene. In this work, we consider the scene graph parsed from text as a proxy for the image scene graph and propose a graph decomposition and augmentation framework along with a coarse-to-fine contrastive learning objective between images and text that aligns sentences of various complexities to the same image. Along with this, we propose novel negative mining techniques in the scene graph space for improving attribute binding and relation understanding. Through extensive experiments, we demonstrate the effectiveness of our approach that significantly improves attribute binding, relation understanding, systematic generalization, and productivity on multiple recently proposed benchmarks (For example, improvements upto $18\%$ for systematic generalization, $16.5\%$ for relation understanding over a strong baseline), while achieving similar or better performance than CLIP on various general multimodal tasks.
翻译:对比训练的视觉-语言模型在视觉与语言表示学习领域取得了显著进展,催生了多项下游多模态任务中的先进模型。然而,近期研究揭示了这些模型在物体、属性及关系组合推理能力上的严重局限。场景图作为图像组合性理解的有效手段,是一种包含物体、属性及其场景内与其他物体关系的图结构语义表示。本研究将文本解析生成的场景图视为图像场景图的代理,提出了一种图分解与增强框架,并设计了图像与文本间的粗到细对比学习目标,用以对齐不同复杂度句子与同一图像。同时,我们创新性地提出了场景图空间中的负样本挖掘技术,以提升属性绑定与关系理解能力。通过大量实验,我们验证了方法的有效性:在多个近期提出的基准测试中,方法显著提升了属性绑定、关系理解、系统泛化及生成能力(例如,相较于强基线,系统泛化提升达18%,关系理解提升达16.5%),同时在各类通用多模态任务上取得与CLIP相当或更优的性能。