Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into supervision from small-scale visual relation data. In particular, we propose two pretraining approaches to contextualise visual entities in a multimodal setup. With verbalised scene graphs, we transform visual relation triplets into structured captions, and treat them as additional image descriptions. With masked relation prediction, we further encourage relating entities from image regions with visually masked contexts. When applied to strong baselines pretrained on large amounts of Web data, zero-shot evaluations on both coarse-grained and fine-grained tasks show the efficacy of our methods in learning multimodal representations from weakly-supervised relations data.
翻译:近期视觉与语言预训练研究探索了利用目标检测数据中的监督信号来学习更优的细粒度多模态表示。本文进一步探索如何利用小规模视觉关系数据中的监督信息。具体而言,我们提出两种预训练方法,用于在多模态场景中对视觉实体进行情境化建模:通过语言化场景图,将视觉关系三元组转化为结构化描述,并将其作为额外的图像描述;通过掩码关系预测,进一步鼓励模型建立图像区域中实体与视觉掩码上下文之间的关联。将所提方法应用于大规模网络数据预训练的强基线模型时,在粗粒度和细粒度任务上的零样本评估结果表明,我们的方法能够有效从弱监督关系数据中学习多模态表示。