Recent advances in 3D feedforward reconstruction neural networks have achieved remarkable success in dense reconstruction from images without any camera parameters. Yet, equipping these models with robust semantic understanding remains an open problem. Here we introduce an approach that performs 3D reconstruction and 3D panoptic segmentation in a unified framework. We build on existing 3D reconstruction models and augment them with a set-based mask decoder. The approach is jointly trained with a geometric and semantic loss, which are shown to be mutually beneficial. More precisely, the features are initialized from the geometric information and then finetuned to capture jointly geometry and semantics. We demonstrate the generality of our approach by successfully applying our framework both to online and all-to-all attention reconstruction backbones. Our method achieves state-of-the-art performance in 3D panoptic segmentation across ScanNet, ScanNet200, and ScanNet++ datasets. Ablation studies show that such joint training of a unified model equips 3D feedforward reconstruction neural networks with panoptic segmentation and yields mutually beneficial improvements.
翻译:近期,3D前馈重建神经网络在无需任何相机参数的情况下,从图像进行密集重建方面取得了显著成功。然而,如何为这些模型赋予鲁棒的语义理解能力仍是一个开放性问题。本文提出了一种方法,在一个统一框架内实现3D重建与3D全景分割。我们基于现有的3D重建模型,并为其增加一个基于集合的掩码解码器。该方法通过几何损失与语义损失进行联合训练,实验表明两者相互促进。具体而言,特征从几何信息初始化,然后微调以同时捕捉几何与语义信息。我们通过将框架成功应用于在线注意力重建骨干网络和全对全注意力重建骨干网络,证明了方法的通用性。本方法在ScanNet、ScanNet200及ScanNet++数据集上的3D全景分割任务中均达到了最先进性能。消融研究表明,这种统一模型的联合训练使3D前馈重建神经网络具备全景分割能力,并带来相互促进的性能提升。