Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we add supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional views of images. With masked relation prediction, we further encourage relating entities from visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.
翻译:近期视觉与语言预训练研究探索了利用目标检测数据中的监督信号来学习更优的细粒度多模态表示。本文进一步探索如何引入小规模视觉关系数据的监督信息。具体而言,我们提出两种预训练方法,在多模态框架下对视觉实体进行语境化处理。通过将关系三元组转化为结构化描述的场景图语言化方法,将其视为图像的附加视图;采用掩码关系预测方法,进一步鼓励从视觉掩码语境中关联实体。将上述方法应用于经大规模网络数据预训练的强基线模型后,在粗粒度与细粒度任务上的零样本评估结果表明,我们提出的方法能有效从弱监督关系数据中学习多模态表示。