Visual Place Recognition is a task that aims to predict the place of an image (called query) based solely on its visual features. This is typically done through image retrieval, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. A major challenge in this task is recognizing places seen from different viewpoints. To overcome this limitation, we propose a new method, called EigenPlaces, to train our neural network on images from different point of views, which embeds viewpoint robustness into the learned global descriptors. The underlying idea is to cluster the training data so as to explicitly present the model with different views of the same points of interest. The selection of this points of interest is done without the need for extra supervision. We then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60\% less GPU memory for training and using 50\% smaller descriptors. The code and trained models for EigenPlaces are available at {\small{\url{https://github.com/gmberton/EigenPlaces}}}, while results with any other baseline can be computed with the codebase at {\small{\url{https://github.com/gmberton/auto_VPR}}}.
翻译:视觉地点识别是一项旨在仅凭图像(即查询)的视觉特征预测其所在位置的任务。该任务通常通过图像检索实现——利用学习得到的全局描述符,将查询图像与包含大量地理标记照片的数据库中的最相似图像进行匹配。该任务面临的主要挑战之一是对不同视角下场景的识别。为克服这一局限,我们提出了一种名为EigenPlaces的新方法,通过在不同视角的图像上训练神经网络,将视角鲁棒性嵌入到学习得到的全局描述符中。其核心思想是对训练数据进行聚类,从而显式地向模型呈现同一兴趣点在不同视角下的图像。兴趣点的选择无需额外监督。随后,我们在文献中最全面的数据集组合上进行了实验,发现EigenPlaces能够在多数数据集上超越先前的最优方法,同时训练所需GPU内存减少60%,并使用缩小50%的描述符。EigenPlaces的代码及预训练模型已开源至{\small{\url{https://github.com/gmberton/EigenPlaces}}},而基于其他基线方法的结果可通过代码库{\small{\url{https://github.com/gmberton/auto_VPR}}}计算获得。