Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images, taken at some unknown location, to a set of geo-tagged reference images. Existing methods, devoted to semantic features representation, evolving towards robustness to a wide variety between query and reference, including illumination and viewpoint changes, as well as scale and seasonal variations. However, practical visual geolocalization approaches need to be robust in appearance changing and extreme viewpoint variation conditions, while providing accurate global location estimates. Therefore, inspired by curriculum design, human learn general knowledge first and then delve into professional expertise. We first recognize semantic scene and then measure geometric structure. Our approach, termed CurriculumLoc, involves a delicate design of multi-stage refinement pipeline and a novel keypoint detection and description with global semantic awareness and local geometric verification. We rerank candidates and solve a particular cross-domain perspective-n-point (PnP) problem based on these keypoints and corresponding descriptors, position refinement occurs incrementally. The extensive experimental results on our collected dataset, TerraTrack and a benchmark dataset, ALTO, demonstrate that our approach results in the aforementioned desirable characteristics of a practical visual geolocalization solution. Additionally, we achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively. Dataset, code and trained models are publicly available on https://github.com/npupilab/CurriculumLoc.
翻译:视觉地理定位是一种经济高效且可扩展的任务,涉及将未知地点拍摄的一张或多张查询图像与一组地理标记的参考图像进行匹配。现有方法致力于语义特征表示,逐步向鲁棒性方向发展,以应对查询图像与参考图像之间广泛的多样性,包括光照变化、视角变化以及尺度和季节变化。然而,实际的视觉地理定位方法需要在表观变化和极端视角变化条件下保持鲁棒性,同时提供准确的全局位置估计。因此,受课程设计的启发——人类先学习通用知识,再深入钻研专业知识,我们首先识别语义场景,然后度量几何结构。我们的方法称为CurriculumLoc,它精细设计了多阶段精化流水线,并提出了一种融合全局语义感知与局部几何验证的新型关键点检测与描述方法。我们基于这些关键点及其对应描述符对候选结果进行重排序,并求解特定的跨域透视n点(PnP)问题,位置精化逐步进行。在我们收集的数据集TerraTrack以及基准数据集ALTO上的大量实验结果表明,我们的方法实现了实际视觉地理定位解决方案所需的上述理想特性。此外,我们在ALTO数据集上分别使用两种不同的距离度量取得了62.6%和94.5%的新高召回率@1分数。数据集、代码及训练模型已在https://github.com/npupilab/CurriculumLoc上公开。