A framework performing Visual Commonsense Reasoning(VCR) needs to choose an answer and further provide a rationale justifying based on the given image and question, where the image contains all the facts for reasoning and requires to be sufficiently understood. Previous methods use a detector applied on the image to obtain a set of visual objects without considering the exact positions of them in the scene, which is inadequate for properly understanding spatial and semantic relationships between objects. In addition, VCR samples are quite diverse, and parameters of the framework tend to be trained suboptimally based on mini-batches. To address above challenges, pseudo 3D perception Transformer with multi-level confidence optimization named PPTMCO is proposed for VCR in this paper. Specifically, image depth is introduced to represent pseudo 3-dimension(3D) positions of objects along with 2-dimension(2D) coordinates in the image and further enhance visual features. Then, considering that relationships between objects are influenced by depth, depth-aware Transformer is proposed to do attention mechanism guided by depth differences from answer words and objects to objects, where each word is tagged with pseudo depth value according to related objects. To better optimize parameters of the framework, a model parameter estimation method is further proposed to weightedly integrate parameters optimized by mini-batches based on multi-level reasoning confidence. Experiments on the benchmark VCR dataset demonstrate the proposed framework performs better against the state-of-the-art approaches.
翻译:执行视觉常识推理(VCR)的框架需要基于给定图像和问题选择答案并进一步提供合理解释,其中图像包含所有推理所需的事实并需要被充分理解。先前的方法使用应用于图像的检测器获取一组视觉对象,但未考虑它们在场景中的精确位置,这不足以恰当理解对象之间的空间和语义关系。此外,VCR样本具有高度多样性,框架参数倾向于基于小批量进行次优训练。为解决上述挑战,本文提出了一种名为PPTMCO的、结合多级置信度优化的伪3D感知变压器用于VCR。具体而言,引入图像深度来表示对象在图像中的伪三维(3D)位置(与二维(2D)坐标一起),并进一步增强视觉特征。考虑到对象间关系受深度影响,提出了深度感知Transformer,通过来自答案词与对象间、对象与对象间的深度差异引导注意力机制,其中每个词根据相关对象被标记伪深度值。为进一步优化框架参数,提出了一种模型参数估计方法,基于多级推理置信度对小批量优化的参数进行加权整合。在基准VCR数据集上的实验表明,所提框架优于现有最先进方法。