Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.
翻译:视觉地点识别(VPR)是机器人定位和增强现实等许多应用中的基本任务。近年来,层次化VPR方法因在准确性与效率之间的权衡而受到广泛关注。这类方法通常首先使用全局特征检索候选图像,随后对匹配的局部特征进行空间一致性验证以重新排序。然而,后者通常依赖RANSAC算法拟合单应性矩阵,这一过程耗时且不可微分,导致现有方法只能将网络训练限定在全局特征提取阶段。为此,本文提出一种基于Transformer的深度单应性估计(DHE)网络,该网络以主干网络提取的密集特征图为输入,通过可学习的几何验证实现快速单应性拟合。此外,我们设计了基于内点重投影误差的损失函数,无需额外单应性标注即可训练DHE网络,并使其能与主干网络联合训练,从而帮助主干网络提取更适用于局部匹配的特征。在基准数据集上的广泛实验表明,我们的方法优于多种现有最优方法,且相比于使用RANSAC的主流层次化VPR方法,速度提升超过一个数量级。相关代码已在https://github.com/Lu-Feng/DHE-VPR 开源。