Fusion-based place recognition is an emerging technique jointly utilizing multi-modal perception data, to recognize previously visited places in GPS-denied scenarios for robots and autonomous vehicles. Recent fusion-based place recognition methods combine multi-modal features in implicit manners. While achieving remarkable results, they do not explicitly consider what the individual modality affords in the fusion system. Therefore, the benefit of multi-modal feature fusion may not be fully explored. In this paper, we propose a novel fusion-based network, dubbed EINet, to achieve explicit interaction of the two modalities. EINet uses LiDAR ranges to supervise more robust vision features for long time spans, and simultaneously uses camera RGB data to improve the discrimination of LiDAR point clouds. In addition, we develop a new benchmark for the place recognition task based on the nuScenes dataset. To establish this benchmark for future research with comprehensive comparisons, we introduce both supervised and self-supervised training schemes alongside evaluation protocols. We conduct extensive experiments on the proposed benchmark, and the experimental results show that our EINet exhibits better recognition performance as well as solid generalization ability compared to the state-of-the-art fusion-based place recognition approaches. Our open-source code and benchmark are released at: https://github.com/BIT-XJY/EINet.
翻译:融合式地点识别是一种新兴技术,通过联合利用多模态感知数据,使机器人和自动驾驶车辆在GPS拒止环境中识别先前到访的地点。近年来的融合式地点识别方法以隐式方式结合多模态特征。尽管取得了显著成果,但这些方法并未显式考虑单一模态在融合系统中的贡献,因此多模态特征融合的优势可能尚未被充分发掘。本文提出一种名为EINet的新型融合网络,以实现两种模态的显式交互。EINet利用激光雷达测距信息监督长期跨度的更强视觉特征,同时使用相机RGB数据提升激光雷达点云的区分性。此外,我们基于nuScenes数据集为地点识别任务构建了新基准。为建立该基准以支持未来研究的全面比较,我们引入监督式与自监督式训练方案及评估协议。我们在所提基准上开展了大量实验,结果表明,与现有最先进的融合式地点识别方法相比,EINet展现出更优的识别性能及良好的泛化能力。我们的开源代码与基准已发布于:https://github.com/BIT-XJY/EINet。