Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, recent studies have shown that even the best VL models struggle to capture aspects of scene understanding, such as object attributes, relationships, and action states. In contrast, obtaining structured annotations, e.g., scene graphs (SGs) that could improve these models is time-consuming, costly, and tedious, and thus cannot be used on a large scale. Here we ask, can small datasets containing SG annotations provide sufficient information for enhancing structured understanding of VL models? We show that it is indeed possible to improve VL models using such data by utilizing a specialized model architecture and a new training paradigm. Our approach captures structure-related information for both the visual and textual encoders by directly supervising both components when learning from SG labels. We use scene graph supervision to generate fine-grained captions based on various graph augmentations highlighting different compositional aspects of the scene, and to predict SG information using an open vocabulary approach by adding special ``Adaptive SG tokens'' to the visual encoder. Moreover, we design a new adaptation technique tailored specifically to the SG tokens that allows better learning of the graph prediction task while still maintaining zero-shot capabilities. Our model shows strong performance improvements on the Winoground and VL-checklist datasets with only a mild degradation in zero-shot performance.
翻译:视觉与语言(VL)模型在多项任务中展现出卓越的零样本性能。然而,近期研究表明,即便是最先进的VL模型也难以充分捕捉场景理解中的关键要素,如物体属性、关系及动作状态。相比之下,获取能够改进这些模型的结构化标注(如场景图)不仅耗时、昂贵且繁琐,因此难以大规模应用。本研究旨在探究:包含场景图(SG)标注的小规模数据集能否为提升VL模型的结构化理解能力提供足够信息?我们证明,通过采用专用模型架构与新型训练范式,确实可以利用此类数据改进VL模型。该方法通过直接监督视觉编码器与文本编码器从场景图标签中学习,为两个组件捕获结构相关信息。具体而言,我们利用场景图监督生成基于多种图增强(突出不同场景组成关系)的细粒度描述,并通过在视觉编码器中添加专用"自适应场景图标记"实现开放词汇场景图信息预测。此外,我们设计了针对场景图标记的特化适配技术,使其在保持零样本能力的同时更高效地学习图预测任务。实验表明,我们的模型在Windoground与VL-checklist数据集上实现显著性能提升,仅轻微牺牲零样本性能。