Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.
翻译:视觉地点识别(VPR)是机器人定位和增强现实等众多应用中的基础任务。近年来,层次化VPR方法因在精度与效率之间的平衡而受到广泛关注。这类方法通常先利用全局特征检索候选图像,再对匹配的局部特征进行空间一致性验证以重新排序。然而,后一步骤通常依赖RANSAC算法进行单应性拟合,该过程耗时且不可微,导致现有方法仅能在全局特征提取阶段对网络进行训练。为此,本文提出一种基于Transformer的深度单应性估计(DHE)网络,该网络以主干网络提取的密集特征图为输入,实现快速且可学习的几何验证单应性拟合。此外,我们设计了一种基于内点重投影误差的损失函数,无需额外单应性标签即可训练DHE网络,该网络还可与主干网络联合训练,帮助其提取更适用于局部匹配的特征。在基准数据集上的大量实验表明,本方法性能优于多项先进方法,且速度较使用RANSAC的主流层次化VPR方法提升一个数量级以上。代码已开源至https://github.com/Lu-Feng/DHE-VPR。