We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.
翻译:本文研究多参考视觉地点识别问题,该问题利用不同环境条件下采集的参考集来提升定位性能。虽然基于大规模训练的深度学习提高了鲁棒性,但增加数据多样性和模型复杂度会导致训练与部署阶段产生高昂计算成本。通过投票或聚合实现的描述符级融合无需训练,但通常针对多传感器设置或依赖启发式方法,在外观与视角变化下增益有限。我们提出一种免训练、描述符无关的方法,通过矩阵分解将多个参考描述符联合建模为基表示,实现基于投影的残差匹配。同时,我们构建了SotonMV——一个面向多视角视觉地点识别的结构化基准数据集。在多外观数据集上,本方法将Recall@1指标较单参考方法提升约18%,在外观与视角变化场景下均优于多参考基线方法,在非结构化数据上获得约5%的性能增益,在保持轻量级特性的同时展现出强大的泛化能力。