SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics with only 2D labeling which can be either manual or machine-generated. Our key technical innovation is to leverage differentiable rendering of color and semantics to bridge 2D observations and unknown 3D space, using the observed RGB images and 2D semantics as supervision, respectively. We additionally develop a learning pipeline and corresponding method to enable learning from imperfect predicted 2D labels, which could be additionally acquired by synthesizing in an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision loop for semantics. In this work, we propose an end-to-end trainable solution jointly addressing geometry completion, colorization, and semantic mapping from limited RGB-D images, without relying on any 3D ground-truth information. Our method achieves state-of-the-art performance of semantic scene reconstruction on two large-scale benchmark datasets MatterPort3D and ScanNet, surpasses baselines even with costly 3D annotations. To our knowledge, our method is also the first 2D-driven method addressing completion and semantic segmentation of real-world 3D scans.

翻译：大多数针对三维室内空间全面语义建模的深度学习方法需要在三维领域进行成本高昂的密集标注。本文探索一项核心的三维场景建模任务，即在不使用任何三维标注的情况下实现语义场景重建。我们方法的关键思路是设计一个可训练的模型，同时利用不完整的三维重建结果及其对应的源RGB-D图像，将跨域特征融合为体素嵌入，仅凭人工或机器生成的二维标签即可预测完整的三维几何、颜色和语义。我们的核心技术贡献在于利用颜色和语义的可微渲染，以观察到的RGB图像和二维语义作为监督信号，分别桥接二维观测与未知的三维空间。此外，我们开发了相应的学习流程和方法，使得模型能够从不完美的预测二维标签中学习——这些标签可通过合成额外虚拟训练视图（作为原始真实采集数据的补充）来获取，从而构建更高效的语义自监督循环。本文提出了一种端到端可训练的解决方案，能够联合处理基于有限RGB-D图像的几何补全、着色和语义映射，且无需依赖任何三维真实标注信息。我们的方法在两个大规模基准数据集MatterPort3D和ScanNet上实现了语义场景重建的最优性能，甚至超越了使用昂贵三维标注的基线方法。据我们所知，该方法也是首个面向真实三维扫描数据补全与语义分割的二维驱动方法。