LaVPR: Benchmarking Language and Vision for Place Recognition

Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform "blind" localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at https://github.com/oferidan1/LaVPR.

翻译：视觉位置识别（VPR）在极端环境变化与感知混淆条件下常出现失效。此外，标准系统无法仅通过语言描述实现"盲"定位，而这种能力在应急响应等应用中至关重要。为应对这些挑战，我们提出了LaVPR——一个通过超过65万条丰富自然语言描述扩展现有VPR数据集的大规模基准。基于LaVPR，我们研究了两种范式：用于增强鲁棒性的多模态融合与基于语言的跨模态检索。实验结果表明，语言描述在视觉退化条件下能带来持续性能提升，且对轻量化骨干网络的影响最为显著。值得注意的是，引入语言模态可使紧凑模型达到与更大规模纯视觉架构相媲美的性能。在跨模态检索方面，我们采用低秩自适应（LoRA）与多重相似度损失建立了基线方法，其在多种视觉-语言模型上显著优于标准对比方法。最终，LaVPR催生了一类新型定位系统，既能适应现实世界的随机性变化，又适合在资源受限场景中部署。我们的数据集与代码已开源：https://github.com/oferidan1/LaVPR。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

《面向无人机实时认知任务解决的视觉-语言-动作（VLA）模型与评估基准》

专知会员服务

42+阅读 · 2025年3月9日

大规模视觉-语言模型的基准、评估、应用与挑战

专知会员服务

18+阅读 · 2025年2月10日

《面向视觉语言地理基础模型》综述

专知会员服务

47+阅读 · 2024年6月15日

【CVPR2024】自然监督下的三维视觉定位与语言规范化的概念学习

专知会员服务

16+阅读 · 2024年5月1日