To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios. Our project page is publicly available at https://yan-xia.github.io/projects/UniLoc/.
翻译:迄今为止,大多数地点识别方法专注于单模态检索。尽管这些方法在特定环境中表现良好,但跨模态方法通过允许在地图与查询源之间无缝切换,提供了更大的灵活性。该方法还通过使用统一模型有望降低计算需求,并通过参数共享实现更高的样本效率。在本研究中,我们提出了一种通用的地点识别解决方案UniLoc,该方案适用于任意单一查询模态(自然语言、图像或点云)。UniLoc利用大规模对比学习的最新进展,通过两个层次的层级匹配进行学习:实例级匹配与场景级匹配。具体而言,我们提出了一种新颖的基于自注意力机制的池化模块,用于评估实例描述符在聚合为地点级描述符时的重要性。在KITTI-360数据集上的实验验证了跨模态在地点识别中的优势,该方法在跨模态设置中实现了卓越性能,同时在单模态场景中也取得了具有竞争力的结果。我们的项目页面已公开于https://yan-xia.github.io/projects/UniLoc/。