Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust global representation method with cross-image correlation awareness for VPR, named CricaVPR. Our method uses the self-attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints, or even captured from different places. Therefore, our method can utilize the cross-image variations as a cue to guide the representation learning, which ensures more robust features are produced. To further facilitate the robustness, we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task, which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. Our method achieves 94.5% R@1 on Pitts30k using 512-dim global features. The code is released at https://github.com/Lu-Feng/CricaVPR.
翻译:过去十年中,大多数视觉位置识别(VPR)方法使用神经网络生成特征表示。这些网络通常仅依赖单张场景图像本身生成全局表示,而忽略了跨图像变化(如视角和光照),这在复杂场景中限制了其鲁棒性。本文提出一种用于VPR的鲁棒全局表示方法CricaVPR,该方法具备跨图像相关性感知能力。我们利用自注意力机制关联批次内的多张图像,这些图像可取自同一地点不同条件或视角,甚至来自不同地点。因此,该方法能利用跨图像变化作为线索引导表示学习,从而生成更鲁棒的特征。为进一步提升鲁棒性,我们提出多尺度卷积增强自适应方法,将预训练视觉基础模型适配到VPR任务中,通过引入多尺度局部信息增强跨图像相关性感知表示。实验结果表明,本方法以显著更短的训练时间大幅超越现有最优方法,在Pitts30k上使用512维全局特征达到94.5%的R@1。代码已开源至https://github.com/Lu-Feng/CricaVPR。