Place recognition is a challenging task in computer vision, crucial for enabling autonomous vehicles and robots to navigate previously visited environments. While significant progress has been made in learnable multimodal methods that combine onboard camera images and LiDAR point clouds, the full potential of these methods remains largely unexplored in localization applications. In this paper, we study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition, incorporating explicit visual semantics and text descriptions. Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors. We employ a late fusion approach to integrate these modalities, providing a unified representation. Through extensive experiments on the Oxford RobotCar and NCLT datasets, we systematically analyze the impact of each data source on the overall quality of place descriptors. Our experiments demonstrate that combining data from multiple sensors significantly improves place recognition model performance compared to single modality approaches and leads to state-of-the-art quality. We also show that separate usage of visual or textual semantics (which are more compact representations of sensory data) can achieve promising results in place recognition. The code for our method is publicly available: https://github.com/alexmelekhin/MSSPlace
翻译:地点识别是计算机视觉领域的一项具有挑战性的任务,对于使自动驾驶车辆和机器人能够在先前访问过的环境中导航至关重要。尽管结合车载相机图像和激光雷达点云的可学习多模态方法已取得显著进展,但这些方法在定位应用中的全部潜力在很大程度上仍未得到充分探索。本文研究了利用多相机设置以及整合多种数据源进行多模态地点识别的影响,并融入了显式的视觉语义和文本描述。我们提出的方法名为MSSPlace,它利用来自多个相机的图像、激光雷达点云、语义分割掩码以及文本注释来生成全面的地点描述符。我们采用后期融合方法来整合这些模态,提供统一的表示。通过在牛津RobotCar和NCLT数据集上进行的大量实验,我们系统地分析了每种数据源对地点描述符整体质量的影响。我们的实验表明,与单模态方法相比,结合来自多个传感器的数据能显著提升地点识别模型的性能,并达到最先进的质量。我们还表明,单独使用视觉或文本语义(它们是传感器数据的更紧凑表示)也能在地点识别中取得有希望的结果。我们方法的代码已公开可用:https://github.com/alexmelekhin/MSSPlace