We propose SNI-SLAM, a semantic SLAM system utilizing neural implicit representation, that simultaneously performs accurate semantic mapping, high-quality surface reconstruction, and robust camera tracking. In this system, we introduce hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene. In addition, to fully utilize the correlation between multiple attributes of the environment, we integrate appearance, geometry and semantic features through cross-attention for feature collaboration. This strategy enables a more multifaceted understanding of the environment, thereby allowing SNI-SLAM to remain robust even when single attribute is defective. Then, we design an internal fusion-based decoder to obtain semantic, RGB, Truncated Signed Distance Field (TSDF) values from multi-level features for accurate decoding. Furthermore, we propose a feature loss to update the scene representation at the feature level. Compared with low-level losses such as RGB loss and depth loss, our feature loss is capable of guiding the network optimization on a higher-level. Our SNI-SLAM method demonstrates superior performance over all recent NeRF-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in accurate semantic segmentation and real-time semantic mapping.
翻译:我们提出SNI-SLAM,一种利用神经隐式表示的语义SLAM系统,能够同时实现精确的语义建图、高质量表面重建和鲁棒相机跟踪。该系统引入分层语义表示,支持对场景进行自上而下的结构化语义建图的多层级语义理解。此外,为充分利用环境多属性间的关联性,我们通过交叉注意力机制整合外观、几何与语义特征以实现特征协作。该策略使环境理解更加多维化,即便单一属性存在缺陷时,SNI-SLAM仍能保持鲁棒性。随后,我们设计基于内部融合的解码器,从多层级特征中获取语义、RGB及截断符号距离场(TSDF)值以实现精确解码。更进一步,我们提出特征损失函数,在特征层面更新场景表示。相较于RGB损失和深度损失等低层级损失,特征损失能在更高层级引导网络优化。在Replica和ScanNet数据集上的建图与跟踪精度方面,我们的SNI-SLAM方法均优于所有近期基于NeRF的SLAM方法,同时展现出出色的语义分割与实时语义建图能力。