Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, recent studies have shown that even the best VL models struggle to capture aspects of scene understanding, such as object attributes, relationships, and action states. In contrast, obtaining structured annotations, e.g., scene graphs (SGs) that could improve these models is time-consuming, costly, and tedious, and thus cannot be used on a large scale. Here we ask, can small datasets containing SG annotations provide sufficient information for enhancing structured understanding of VL models? We show that it is indeed possible to improve VL models using such data by utilizing a specialized model architecture and a new training paradigm. Our approach captures structure-related information for both the visual and textual encoders by directly supervising both components when learning from SG labels. We use scene graph supervision to generate fine-grained captions based on various graph augmentations highlighting different compositional aspects of the scene, and to predict SG information using an open vocabulary approach by adding special ``Adaptive SG tokens'' to the visual encoder. Moreover, we design a new adaptation technique tailored specifically to the SG tokens that allows better learning of the graph prediction task while still maintaining zero-shot capabilities. Our model shows strong performance improvements on the Winoground and VL-checklist datasets with only a mild degradation in zero-shot performance.

翻译：视觉与语言（VL）模型在多项任务中展现出卓越的零样本性能。然而，近期研究表明，即便是最先进的VL模型也难以充分捕捉场景理解中的关键要素，如物体属性、关系及动作状态。相比之下，获取能够改进这些模型的结构化标注（如场景图）不仅耗时、昂贵且繁琐，因此难以大规模应用。本研究旨在探究：包含场景图（SG）标注的小规模数据集能否为提升VL模型的结构化理解能力提供足够信息？我们证明，通过采用专用模型架构与新型训练范式，确实可以利用此类数据改进VL模型。该方法通过直接监督视觉编码器与文本编码器从场景图标签中学习，为两个组件捕获结构相关信息。具体而言，我们利用场景图监督生成基于多种图增强（突出不同场景组成关系）的细粒度描述，并通过在视觉编码器中添加专用"自适应场景图标记"实现开放词汇场景图信息预测。此外，我们设计了针对场景图标记的特化适配技术，使其在保持零样本能力的同时更高效地学习图预测任务。实验表明，我们的模型在Windoground与VL-checklist数据集上实现显著性能提升，仅轻微牺牲零样本性能。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日