Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlations, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named $R^{2}$Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can also be adopted on other CNN or transformer backbones as a generic component. Remarkably, $R^{2}$Former significantly outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption. It also achieves the state-of-the-art on the hold-out MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code is released at https://github.com/Jeff-Zilence/R2Former.
翻译:视觉地点识别(VPR)通过将查询图像与参考数据库中的图像进行匹配来估计其位置。传统方法通常采用聚合的CNN特征进行全局检索,并基于RANSAC的几何验证进行重排序。然而,RANSAC仅利用几何信息,而忽略了其他可能对重排序有用的信息,例如局部特征相关性和注意力值。本文提出了一种统一的地点识别框架,通过名为$R^{2}$Former的新型Transformer模型同时处理检索与重排序。所提出的重排序模块综合考虑特征相关性、注意力值和xy坐标,通过学习判断图像对是否来自同一地点。整个流程可端到端训练,且该重排序模块也可作为通用组件应用于其他CNN或Transformer骨干网络。值得注意的是,$R^{2}$Former在主要VPR数据集上以更少的推理时间和内存消耗显著超越现有最先进方法。它在保留的MSLS挑战集上也达到了最先进水平,可作为面向实际大规模应用的简单而强大的解决方案。实验还表明,视觉Transformer标记在局部匹配方面可与CNN局部特征相媲美,甚至更优。代码已发布于https://github.com/Jeff-Zilence/R2Former。