Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.
翻译:近期研究表明,在大规模数据通用视觉学习任务中预训练的视觉模型,可为多种视觉感知问题提供有效的特征表示。然而,目前鲜有研究将预训练基础模型应用于视觉地点识别任务。由于模型预训练与视觉地点识别在训练目标和数据方面存在固有差异,如何弥合这一差距并充分释放预训练模型在视觉地点识别中的能力,仍是亟待解决的关键问题。为此,本文提出一种新颖方法,实现预训练模型对视觉地点识别的无缝适配。具体而言,为获取聚焦显著地标以识别地点的全局与局部特征,我们设计了一种高效实现全局与局部适配的混合适配方法,该方法仅需调整轻量级适配器而无需改动预训练模型。此外,为引导有效适配,我们提出一种互近邻局部特征损失函数,该函数能确保生成恰当的密集局部特征用于局部匹配,并避免重排序中耗时的空间验证步骤。实验结果表明,本方法在训练数据与训练时间更少的情况下仍优于现有最优方法,且其检索耗时仅为采用基于RANSAC空间验证的两阶段视觉地点识别方法的约3%。本方法在MSLS挑战排行榜(投稿时)位列第一。代码已开源至https://github.com/Lu-Feng/SelaVPR。