Cross-View Geo-Localization (CVGL) estimates the location of a ground image by matching it to a geo-tagged aerial image in a database. Recent works achieve outstanding progress on CVGL benchmarks. However, existing methods still suffer from poor performance in cross-area evaluation, in which the training and testing data are captured from completely distinct areas. We attribute this deficiency to the lack of ability to extract the geometric layout of visual features and models' overfitting to low-level details. Our preliminary work introduced a Geometric Layout Extractor (GLE) to capture the geometric layout from input features. However, the previous GLE does not fully exploit information in the input feature. In this work, we propose GeoDTR+ with an enhanced GLE module that better models the correlations among visual features. To fully explore the LS techniques from our preliminary work, we further propose Contrastive Hard Samples Generation (CHSG) to facilitate model training. Extensive experiments show that GeoDTR+ achieves state-of-the-art (SOTA) results in cross-area evaluation on CVUSA, CVACT, and VIGOR by a large margin ($16.44\%$, $22.71\%$, and $17.02\%$ without polar transformation) while keeping the same-area performance comparable to existing SOTA. Moreover, we provide detailed analyses of GeoDTR+.
翻译:跨视角地理定位(CVGL)通过将地面图像与数据库中带地理标记的航拍图像进行匹配来估计其位置。近年来,相关研究在CVGL基准测试中取得了显著进展。然而,现有方法在跨区域评估中仍表现不佳——即训练与测试数据来自完全不同区域的情况。我们将此缺陷归因于模型缺乏提取视觉特征几何布局的能力,以及对低级细节的过拟合。我们的前期工作引入了几何布局提取器(GLE)来从输入特征中捕获几何布局,但早期的GLE未能充分利用输入特征中的信息。本研究提出GeoDTR+,通过增强型GLE模块更好地建模视觉特征间的相关性。为充分挖掘前期工作中的潜在技术(LS),我们进一步提出对比困难样本生成(CHSG)以促进模型训练。大量实验表明,GeoDTR+在CVUSA、CVACT和VIGOR数据集上的跨区域评估中取得了显著优于现有最优方法的性能(在未进行极坐标变换的情况下,分别提升16.44%、22.71%和17.02%),同时保持同区域评估性能与现有最优方法相当。此外,我们对GeoDTR+进行了详细分析。