BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction From A Single Image

Understanding and modeling the 3D scene from a single image is a practical problem. A recent advance proposes a panoptic 3D scene reconstruction task that performs both 3D reconstruction and 3D panoptic segmentation from a single image. Although having made substantial progress, recent works only focus on top-down approaches that fill 2D instances into 3D voxels according to estimated depth, which hinders their performance by two ambiguities. (1) instance-channel ambiguity: The variable ids of instances in each scene lead to ambiguity during filling voxel channels with 2D information, confusing the following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting with estimated single view depth only propagates 2D information onto the surface of 3D regions, leading to ambiguity during the reconstruction of regions behind the frontal view surface. In this paper, we propose BUOL, a Bottom-Up framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image. For instance-channel ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on deterministic semantic assignments rather than arbitrary instance id assignments. The 3D voxels are then refined and grouped into 3D instances according to the predicted 2D instance centers. For voxel-reconstruction ambiguity, the estimated multi-plane occupancy is leveraged together with depth to fill the whole regions of things and stuff. Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D. Code and models are available in https://github.com/chtsy/buol.

翻译：从单张图像理解和建模3D场景是一个具有实用价值的问题。近期研究提出了一项全景3D场景重建任务，该任务能够从单张图像同时完成3D重建和3D全景分割。尽管已取得显著进展，现有工作仅聚焦于自顶向下的方法，即根据估计的深度将2D实例填充到3D体素中，但该方法受限于两个歧义性问题：（1）实例通道歧义性：场景中实例的可变ID导致在将2D信息填充至体素通道时产生混淆，进而影响后续的3D精化过程；（2）体素重建歧义性：基于单视图深度估计的2D到3D提升仅能将2D信息传播至3D区域表面，导致正面视图后方区域的重建存在歧义。本文提出BUOL框架——一种带有占用感知提升的自底向上方法，以解决单张图像全景3D场景重建中的上述两个问题。针对实例通道歧义性，自底向上框架基于确定性的语义分配而非任意实例ID分配，将2D信息提升至3D体素；随后根据预测的2D实例中心对3D体素进行精化并分组为3D实例。针对体素重建歧义性，我们结合估计的多平面占用与深度信息，对物体和背景的完整区域进行填充。本方法在合成数据集3D-Front和真实世界数据集Matterport3D上均展现出显著优于现有最优方法的性能优势。代码和模型已开源在https://github.com/chtsy/buol。