The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.
翻译:视觉地点识别(VPR)任务旨在仅依赖视觉线索,将查询图像与来自不同地点的大规模图像数据库中的参考图像进行匹配。当前最先进的流程侧重于对从深度骨干网络提取的特征进行聚合,从而为每张图像生成一个全局描述符。在此背景下,我们提出了SALAD(用于局部聚合描述符的Sinkhorn算法),该方法将NetVLAD中局部特征到簇的软分配重新表述为一个最优传输问题。在SALAD中,我们同时考虑了特征到簇和簇到特征的关系,并引入了一个“垃圾箱”簇,旨在选择性地丢弃被认为非信息性的特征,从而提升整体描述符的质量。此外,我们采用并微调DINOv2作为骨干网络,它为局部特征提供了更强的描述能力,并大幅减少了所需的训练时间。因此,我们的单阶段方法不仅在公开VPR数据集上超越了单阶段基线模型,也超越了那些需要付出显著更高计算成本进行重排序的两阶段方法。代码与模型可在 https://github.com/serizba/salad 获取。