The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on task-specific data. In this paper, we propose a simple yet powerful approach to better exploit the potential of a foundation model for VPR. We first demonstrate that features extracted from self-attention layers can serve as a powerful re-ranker for VPR. Utilizing these features in a zero-shot manner, our method surpasses previous zero-shot methods and achieves competitive results compared to supervised methods across multiple datasets. Subsequently, we demonstrate that a single-stage method leveraging internal ViT layers for pooling can generate global features that achieve state-of-the-art results, even when reduced to a dimensionality as low as 128D. Nevertheless, incorporating our local foundation features for re-ranking, expands this gap. Our approach further demonstrates remarkable robustness and generalization, achieving state-of-the-art results, with a significant gap, in challenging scenarios, involving occlusion, day-night variations, and seasonal changes.
翻译:视觉地点识别(VPR)的任务是根据地理标记图像数据库预测查询图像的位置。VPR领域的最新研究强调了采用预训练基础模型(如DINOv2)进行VPR任务的显著优势。然而,这些模型通常被认为需要在特定任务数据上进行进一步微调才能适用于VPR。本文提出了一种简单而强大的方法,以更好地挖掘基础模型在VPR中的潜力。我们首先证明,从自注意力层提取的特征可以作为VPR的强大重排序器。以零样本方式利用这些特征,我们的方法超越了以往的零样本方法,并在多个数据集上取得了与有监督方法相竞争的结果。随后,我们证明,利用ViT内部层进行池化的单阶段方法可以生成全局特征,即使降至低至128维,也能达到最先进的性能。然而,结合我们提出的局部基础特征进行重排序,进一步扩大了这一优势。我们的方法还表现出卓越的鲁棒性和泛化能力,在涉及遮挡、昼夜变化和季节变化等挑战性场景中,以显著优势取得了最先进的结果。