We present a novel real-time capable learning method that jointly perceives a 3D scene's geometry structure and semantic labels. Recent approaches to real-time 3D scene reconstruction mostly adopt a volumetric scheme, where a truncated signed distance function (TSDF) is directly regressed. However, these volumetric approaches tend to focus on the global coherence of their reconstructions, which leads to a lack of local geometrical detail. To overcome this issue, we propose to leverage the latent geometrical prior knowledge in 2D image features by explicit depth prediction and anchored feature generation, to refine the occupancy learning in TSDF volume. Besides, we find that this cross-dimensional feature refinement methodology can also be adopted for the semantic segmentation task. Hence, we proposed an end-to-end cross-dimensional refinement neural network (CDRNet) to extract both 3D mesh and 3D semantic labeling in real time. The experiment results show that the proposed method achieves state-of-the-art 3D perception efficiency on multiple datasets, which indicates the great potential of our method for industrial applications.
翻译:我们提出了一种新颖的实时学习算法,能够联合感知三维场景的几何结构与语义标签。近期针对实时三维场景重建的方法大多采用体素方案,直接回归截断符号距离函数(TSDF)。然而,此类体素方法往往侧重于重建的整体一致性,导致局部几何细节的缺失。为解决该问题,我们提出通过显式深度预测与锚定特征生成,利用二维图像特征中蕴含的几何先验知识,精化TSDF体素中的占据学习。此外,我们发现这种跨维度特征精化方法同样适用于语义分割任务。因此,我们构建了端到端的跨维度精化神经网络(CDRNet),以实时提取三维网格与三维语义标签。实验结果表明,本方法在多个数据集上达到了最先进的三维感知效率,彰显了其在工业应用中的巨大潜力。