Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images. However, the imaging gaps between platforms are often significant and the variations in viewpoints are substantial, which limits the ability of existing methods to effectively associate cross-view features and extract consistent and invariant characteristics. Moreover, existing methods often overlook the problem of increased computational and storage requirements when improving model performance. To handle these limitations, we propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN). The MEAN network uses a progressive multi-level enhancement strategy, global-to-local associations, and cross-domain alignment, enabling feature communication across levels. This allows MEAN to effectively connect features at different levels and learn robust cross-view consistent mappings and modality-invariant features. Moreover, MEAN adopts a shallow backbone network combined with a lightweight branch design, effectively reducing parameter count and computational complexity. Experimental results on the University-1652 and SUES-200 datasets demonstrate that MEAN reduces parameter count by 62.17% and computational complexity by 70.99% compared to state-of-the-art models, while maintaining competitive or even superior performance. The codes will be released soon.
翻译:跨视角地理定位(CVGL)旨在通过检索最相似的GPS标记卫星图像来确定无人机图像的定位。然而,平台间的成像差异通常显著,且视角变化巨大,这限制了现有方法有效关联跨视角特征并提取一致且不变特征的能力。此外,现有方法在提升模型性能时往往忽视了计算与存储需求增加的问题。为应对这些局限,我们提出了一种轻量级增强对齐网络,称为多级嵌入与对齐网络(MEAN)。MEAN网络采用渐进式多级增强策略、全局到局部关联以及跨域对齐机制,实现了跨层级特征交互。这使得MEAN能够有效连接不同层级的特征,并学习鲁棒的跨视角一致映射与模态不变特征。此外,MEAN采用浅层骨干网络结合轻量级分支设计,显著减少了参数量与计算复杂度。在University-1652和SUES-200数据集上的实验结果表明,相较于最先进模型,MEAN将参数量降低了62.17%,计算复杂度降低了70.99%,同时保持了具有竞争力甚至更优的性能。代码即将公开。