Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction, using a fully self-supervised approach. To this end, we design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics. Our key technical innovation is to leverage differentiable rendering of color and semantics, using the observed RGB images and a generic semantic segmentation model as color and semantics supervision, respectively. We additionally develop a method to synthesize an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision for semantics. In this work we propose an end-to-end trainable solution jointly addressing geometry completion, colorization, and semantic mapping from a few RGB-D images, without 3D or 2D ground-truth. Our method is the first, to our knowledge, fully self-supervised method addressing completion and semantic segmentation of real-world 3D scans. It performs comparably well with the 3D supervised baselines, surpasses baselines with 2D supervision on real datasets, and generalizes well to unseen scenes.
翻译:大多数用于三维室内空间全面语义建模的深度学习方法都依赖于三维领域昂贵的密集标注。在这项工作中,我们探索了一项核心的三维场景建模任务——语义场景重建,并采用完全自监督的方法。为此,我们设计了一个可训练模型,该模型同时利用不完整的三维重建结果及其对应的源RGB-D图像,将跨域特征融合为体素嵌入,以预测完整的三维几何、颜色和语义。我们的关键技术创新在于利用颜色和语义的可微渲染,分别以观测到的RGB图像和通用语义分割模型作为颜色和语义监督。此外,我们还开发了一种方法,用于合成一组补充原始真实拍摄的增强虚拟训练视图,从而实现更高效的语义自监督。在这项工作中,我们提出了一种端到端的可训练解决方案,从少量RGB-D图像中联合处理几何补全、着色和语义映射,无需三维或二维真实标注。据我们所知,我们的方法是首个完全自监督的方法,用于处理真实世界三维扫描的补全和语义分割。它在性能上与基于三维监督的基线方法相当,在真实数据集上超越基于二维监督的基线方法,并且对未见场景具有良好的泛化能力。