BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction From A Single Image

Understanding and modeling the 3D scene from a single image is a practical problem. A recent advance proposes a panoptic 3D scene reconstruction task that performs both 3D reconstruction and 3D panoptic segmentation from a single image. Although having made substantial progress, recent works only focus on top-down approaches that fill 2D instances into 3D voxels according to estimated depth, which hinders their performance by two ambiguities. (1) instance-channel ambiguity: The variable ids of instances in each scene lead to ambiguity during filling voxel channels with 2D information, confusing the following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting with estimated single view depth only propagates 2D information onto the surface of 3D regions, leading to ambiguity during the reconstruction of regions behind the frontal view surface. In this paper, we propose BUOL, a Bottom-Up framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image. For instance-channel ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on deterministic semantic assignments rather than arbitrary instance id assignments. The 3D voxels are then refined and grouped into 3D instances according to the predicted 2D instance centers. For voxel-reconstruction ambiguity, the estimated multi-plane occupancy is leveraged together with depth to fill the whole regions of things and stuff. Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D. Code and models are available in https://github.com/chtsy/buol.

翻译：从单张图像理解并建模3D场景是一个实际问题。近期的研究提出了全景3D场景重建任务，旨在从单张图像同时进行3D重建与3D全景分割。尽管取得了显著进展，但现有工作仅聚焦于自上而下的方法，即根据估计的深度将2D实例填充到3D体素中，这因两种歧义性问题限制了其性能：（1）实例-通道歧义性：场景中实例的变长ID导致在将2D信息填充至体素通道时产生歧义，干扰后续3D精细优化；（2）体素重建歧义性：基于单视角估计深度的2D至3D提升仅将2D信息传播至3D区域表面，导致正面视角表面后方区域的重建存在歧义。本文提出BUOL——一种面向单张图像全景3D场景重建的底层占用感知提升框架，以解决上述两个问题。针对实例-通道歧义性，底层框架基于确定性语义分配而非任意实例ID分配，将2D信息提升至3D体素；随后根据预测的2D实例中心对3D体素进行精细优化并分组为3D实例。针对体素重建歧义性，我们将估计的多平面占用信息与深度结合，共同填充物体与背景的完整区域。本方法在合成数据集3D-Front和真实数据集Matterport3D上均展现出对最新方法的显著性能优势。代码与模型已开源至https://github.com/chtsy/buol。