Despite rapid progress in Visual question answering (VQA), existing datasets and models mainly focus on testing reasoning in 2D. However, it is important that VQA models also understand the 3D structure of visual scenes, for example to support tasks like navigation or manipulation. This includes an understanding of the 3D object pose, their parts and occlusions. In this work, we introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. We address 3D-aware VQA from both the dataset and the model perspective. First, we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains questions about object parts, their 3D poses, and occlusions. Second, we propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition. Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an important open research area.
翻译:尽管视觉问答(VQA)领域取得了快速进展,现有数据集和模型主要侧重于测试二维推理能力。然而,VQA模型理解视觉场景的三维结构至关重要,例如支持导航或操作等任务。这包括对三维物体姿态、其部件以及遮挡的理解。在本工作中,我们引入了3D感知VQA任务,其核心在于需要基于视觉场景三维结构进行组合推理的复杂问题。我们同时从数据集和模型两个角度研究3D感知VQA。首先,我们提出了Super-CLEVR-3D,一个包含有关物体部件、三维姿态及遮挡问题的组合推理数据集。其次,我们提出了PO3D-VQA,一个融合两大核心理念的3D感知VQA模型:用于推理的概率神经符号程序执行,以及结合物体三维生成表征的深度神经网络以实现鲁棒视觉识别。实验结果表明,我们的PO3D-VQA模型显著优于现有方法,但与2D VQA基准相比仍存在显著性能差距,这表明3D感知VQA仍是一个重要的开放研究领域。