EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

Issar Tzachor,Boaz Lerner,Matan Levy,Michael Green,Tal Berkovitz Shalev,Gavriel Habib,Dvir Samuel,Noam Korngut Zailer,Or Shimshi,Nir Darshan,Rami Ben-Ari

The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on task-specific data. In this paper, we propose a simple yet powerful approach to better exploit the potential of a foundation model for VPR. We first demonstrate that features extracted from self-attention layers can serve as a powerful re-ranker for VPR. Utilizing these features in a zero-shot manner, our method surpasses previous zero-shot methods and achieves competitive results compared to supervised methods across multiple datasets. Subsequently, we demonstrate that a single-stage method leveraging internal ViT layers for pooling can generate global features that achieve state-of-the-art results, even when reduced to a dimensionality as low as 128D. Nevertheless, incorporating our local foundation features for re-ranking, expands this gap. Our approach further demonstrates remarkable robustness and generalization, achieving state-of-the-art results, with a significant gap, in challenging scenarios, involving occlusion, day-night variations, and seasonal changes.

翻译：视觉地点识别（VPR）的任务是根据地理标记图像数据库预测查询图像的位置。VPR领域的最新研究强调了采用预训练基础模型（如DINOv2）进行VPR任务的显著优势。然而，这些模型通常被认为需要在特定任务数据上进行进一步微调才能适用于VPR。本文提出了一种简单而强大的方法，以更好地挖掘基础模型在VPR中的潜力。我们首先证明，从自注意力层提取的特征可以作为VPR的强大重排序器。以零样本方式利用这些特征，我们的方法超越了以往的零样本方法，并在多个数据集上取得了与有监督方法相竞争的结果。随后，我们证明，利用ViT内部层进行池化的单阶段方法可以生成全局特征，即使降至低至128维，也能达到最先进的性能。然而，结合我们提出的局部基础特征进行重排序，进一步扩大了这一优势。我们的方法还表现出卓越的鲁棒性和泛化能力，在涉及遮挡、昼夜变化和季节变化等挑战性场景中，以显著优势取得了最先进的结果。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日