Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

In this work, we aim at an important but less explored problem of a simple yet effective backbone specific for cross-view geo-localization task. Existing methods for cross-view geo-localization tasks are frequently characterized by 1) complicated methodologies, 2) GPU-consuming computations, and 3) a stringent assumption that aerial and ground images are centrally or orientation aligned. To address the above three challenges for cross-view image matching, we propose a new backbone network, named Simple Attention-based Image Geo-localization network (SAIG). The proposed SAIG effectively represents long-range interactions among patches as well as cross-view correspondence with multi-head self-attention layers. The "narrow-deep" architecture of our SAIG improves the feature richness without degradation in performance, while its shallow and effective convolutional stem preserves the locality, eliminating the loss of patchify boundary information. Our SAIG achieves state-of-the-art results on cross-view geo-localization, while being far simpler than previous works. Furthermore, with only 15.9% of the model parameters and half of the output dimension compared to the state-of-the-art, the SAIG adapts well across multiple cross-view datasets without employing any well-designed feature aggregation modules or feature alignment algorithms. In addition, our SAIG attains competitive scores on image retrieval benchmarks, further demonstrating its generalizability. As a backbone network, our SAIG is both easy to follow and computationally lightweight, which is meaningful in practical scenario. Moreover, we propose a simple Spatial-Mixed feature aggregation moDule (SMD) that can mix and project spatial information into a low-dimensional space to generate feature descriptors... (The code is available at https://github.com/yanghongji2007/SAIG)

翻译：在这项工作中，我们针对跨视角地理定位任务中一个重要但较少探索的问题——即设计一个简单而有效的专用主干网络。现有跨视角地理定位方法常存在以下特征：1) 方法复杂，2) 计算消耗GPU资源大，3) 严格假设航拍图与地面图在中心或方向上是配准的。为解决跨视角图像匹配的上述三个挑战，我们提出了一种名为简单注意力图像地理定位网络（SAIG）的新主干网络。所提出的SAIG利用多头自注意力层有效建模图像块间的长程交互以及跨视角对应关系。其“窄-深”架构在不降低性能的前提下提升了特征丰富度，而浅层高效的卷积主干保留了局部性，避免了分块边界信息的丢失。我们的SAIG在跨视角地理定位任务上取得了最先进的结果，同时比先前工作简单得多。此外，与最先进方法相比，SAIG仅需15.9%的模型参数和一半的输出维度，即可在多个跨视角数据集上良好适应，无需使用任何精心设计的特征聚合模块或特征对齐算法。同时，SAIG在图像检索基准上取得了具有竞争力的分数，进一步证明了其泛化能力。作为主干网络，SAIG既易于遵循又计算轻量，这在实际场景中具有重要意义。此外，我们提出了一种简单的空间混合特征聚合模块（SMD），可将空间信息混合并投影到低维空间以生成特征描述符……（代码可在https://github.com/yanghongji2007/SAIG获取）