This paper presents an efficient online framework to solve the well-known semantic Visual Simultaneous Localization and Mapping (V-SLAM) problem for indoor scenes leveraging the advantages of neural implicit scene representation. Existing methods on similar lines, such as NICE-SLAM, has some critical practical limitations to put to use for such an important indoor scene understanding problem. To this end, we contend for the following proposition for modern semantic V-SLAM contrary to existing methods assuming RGB-D frames as input (i) For a rigid scene, robust and accurate camera motion could be computed with disentangled tracking and 3D mapping pipeline. (ii) Using neural fields, a dense and multifaceted scene representation of SDF, semantics, RGB, and depth is provided memory efficiently. (iii) Rather than using every frame, we demonstrate that the set of keyframes is sufficient to learn excellent scene representation, thereby improving the pipeline's train time. (iv) Multiple local mapping networks could be used to extend the pipeline for large-scale scenes. We show via extensive experiments on several popular benchmark datasets that our approach offers accurate tracking, mapping, and semantic labeling at test time even with noisy and highly sparse depth measurements. Later in the paper, we show that our pipeline can easily extend to RGB image input. Overall, the proposed pipeline offers a favorable solution to an important scene understanding task that can assist in diverse robot visual perception and related problems.
翻译:本文提出了一种高效的在线框架,利用神经隐式场景表示的优势,解决室内场景中著名的语义视觉同步定位与地图构建(V-SLAM)问题。现有类似方法(如NICE-SLAM)在应用于这一重要的室内场景理解问题时,存在一些关键的实际局限性。为此,对于现代语义V-SLAM,我们主张以下命题,与假设RGB-D帧作为输入的现有方法不同:(i)对于刚性场景,通过解耦的跟踪与三维建图管线,可以计算出鲁棒且精确的相机运动。(ii)利用神经场,以内存高效的方式提供包括SDF、语义、RGB和深度在内的稠密且多层次的场景表示。(iii)我们证明,无需使用每一帧,关键帧集合就足以学习优秀的场景表示,从而改善管线的训练时间。(iv)可以采用多个局部建图网络将管线扩展到大规模场景。通过在多个流行基准数据集上的广泛实验,我们表明,即使在噪声较大且深度测量稀疏的情况下,我们的方法在测试时也能实现精确的跟踪、建图和语义标注。在论文后续部分,我们展示该管线可轻松扩展到RGB图像输入。总体而言,所提出的管线为这一重要场景理解任务提供了一个有利的解决方案,可辅助多样的机器人视觉感知及相关问题。