Visual commonsense reasoning (VCR) is a challenging multi-modal task, which requires high-level cognition and commonsense reasoning ability about the real world. In recent years, large-scale pre-training approaches have been developed and promoted the state-of-the-art performance of VCR. However, the existing approaches almost employ the BERT-like objectives to learn multi-modal representations. These objectives motivated from the text-domain are insufficient for the excavation on the complex scenario of visual modality. Most importantly, the spatial distribution of the visual objects is basically neglected. To address the above issue, we propose to construct the spatial relation graph based on the given visual scenario. Further, we design two pre-training tasks named object position regression (OPR) and spatial relation classification (SRC) to learn to reconstruct the spatial relation graph respectively. Quantitative analysis suggests that the proposed method can guide the representations to maintain more spatial context and facilitate the attention on the essential visual regions for reasoning. We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.
翻译:视觉常识推理(VCR)是一项具有挑战性的多模态任务,需要具备对现实世界的高级认知和常识推理能力。近年来,大规模预训练方法已被开发并推动了VCR最先进性能的发展。然而,现有方法几乎都采用类似BERT的目标函数来学习多模态表示。这些源于文本领域的目标函数不足以挖掘视觉模态的复杂场景,尤其是视觉对象的空间分布基本被忽视。针对上述问题,我们提出基于给定视觉场景构建空间关系图,进一步设计名为对象位置回归(OPR)和空间关系分类(SRC)的两个预训练任务,以分别学习重构空间关系图。定量分析表明,所提方法能引导表示保留更多空间上下文信息,并促进对推理所需关键视觉区域的注意力聚焦。我们在VCR及另外两项视觉与语言推理任务VQA和NLVR上均取得了最先进的结果。