We present a method to map 2D image observations of a scene to a persistent 3D scene representation, enabling novel view synthesis and disentangled representation of the movable and immovable components of the scene. Motivated by the bird's-eye-view (BEV) representation commonly used in vision and robotics, we propose conditional neural groundplans, ground-aligned 2D feature grids, as persistent and memory-efficient scene representations. Our method is trained self-supervised from unlabeled multi-view observations using differentiable rendering, and learns to complete geometry and appearance of occluded regions. In addition, we show that we can leverage multi-view videos at training time to learn to separately reconstruct static and movable components of the scene from a single image at test time. The ability to separately reconstruct movable objects enables a variety of downstream tasks using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance-level segmentation, 3D bounding box prediction, and scene editing. This highlights the value of neural groundplans as a backbone for efficient 3D scene understanding models.
翻译:我们提出了一种方法,将场景的二维图像观测映射为持久的3D场景表示,从而支持新视角合成以及场景中可移动与不可移动组件的解耦表示。受视觉与机器人学中常用鸟瞰图(BEV)表示的启发,我们提出条件神经地面图——即与地面对齐的二维特征网格——作为持久且内存高效的场景表示。该方法利用可微分渲染,从无标注的多视角观测中进行自监督训练,学习补全被遮挡区域的几何与外观。此外,我们展示了在训练时利用多视角视频,能够学习在测试时从单张图像中分别重建场景的静态与可移动组件。单独重建可移动对象的能力,使得通过简单启发式方法即可实现多种下游任务,例如面向对象的3D表示提取、新视角合成、实例级分割、3D边界框预测以及场景编辑。这凸显了神经地面图作为高效3D场景理解模型骨干的价值。