Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this approach, we propose a novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly combines these components via neural fields, 2D pre-trained vision-language models, and neural reasoning operators. Experimental results suggest that our framework outperforms baseline models by a large margin, but the challenge remains largely unsolved. We further perform an in-depth analysis of the challenges and highlight potential future directions.
翻译:人类能够通过收集周围世界的多视角观察,在3D空间中进行精确推理。受此启发,我们引入了一个用于3D多视角视觉问答(3DMV-VQA)的全新大规模基准数据集。该数据集由具身智能体在环境中使用Habitat模拟器主动移动并捕捉RGB图像收集而成,总计包含约5000个场景、60万张图像以及5万个问答对。我们在该基准上评估了多种用于视觉推理的先进模型,发现它们均表现欠佳。我们认为,从多视角图像进行3D推理的原则性方法应当是:从多视角图像中推断出世界的紧凑3D表示,并进一步基于开放词汇的语义概念,在这些3D表示上执行推理。作为该方法的初步探索,我们提出了一种新颖的3D概念学习与推理(3D-CLR)框架,该框架通过神经场、二维预训练视觉语言模型和神经推理算子无缝整合了这些组件。实验结果表明,我们的框架大幅超越了基线模型,但该挑战仍未完全解决。我们进一步深入分析了现存挑战,并指出了潜在的研究方向。