We propose SNI-SLAM, a semantic SLAM system utilizing neural implicit representation, that simultaneously performs accurate semantic mapping, high-quality surface reconstruction, and robust camera tracking. In this system, we introduce hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene. In addition, to fully utilize the correlation between multiple attributes of the environment, we integrate appearance, geometry and semantic features through cross-attention for feature collaboration. This strategy enables a more multifaceted understanding of the environment, thereby allowing SNI-SLAM to remain robust even when single attribute is defective. Then, we design an internal fusion-based decoder to obtain semantic, RGB, Truncated Signed Distance Field (TSDF) values from multi-level features for accurate decoding. Furthermore, we propose a feature loss to update the scene representation at the feature level. Compared with low-level losses such as RGB loss and depth loss, our feature loss is capable of guiding the network optimization on a higher-level. Our SNI-SLAM method demonstrates superior performance over all recent NeRF-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in accurate semantic segmentation and real-time semantic mapping.
翻译:我们提出SNI-SLAM,一种利用神经隐式表示的语义SLAM系统,能够同时完成精确的语义建图、高质量表面重建和鲁棒相机跟踪。在该系统中,我们引入分层语义表示,以支持场景自上而下结构化语义建图的多层级语义理解。此外,为充分利用环境多属性之间的关联性,我们通过交叉注意力融合外观、几何和语义特征以实现特征协作。该策略使环境理解更具多维性,从而即使单一属性存在缺陷时,SNI-SLAM仍能保持鲁棒性。随后,我们设计基于内部融合的解码器,从多层特征中获取语义、RGB、截断符号距离场(TSDF)值以实现精确解码。进一步地,我们提出特征损失函数以在特征层面更新场景表示。与RGB损失和深度损失等低层级损失相比,特征损失能够在更高层级引导网络优化。在Replica和ScanNet数据集上,我们的SNI-SLAM方法在制图与跟踪精度方面均优于所有近期基于NeRF的SLAM方法,同时展现出精确语义分割和实时语义建图的卓越能力。