Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate a Multi-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset called G2D for the "Ground-to-Drone" localization task, enriching existing datasets and filling the gap in "Ground-to-Drone" localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.
翻译:跨视角物体地理定位技术因其潜在应用价值近年来备受关注。现有方法主要通过注意力机制捕捉查询物体在不同视角间的空间依赖关系,以获取空间关系特征图,进而预测物体位置。尽管这些方法展现出潜力,但其未能有效实现视角间的信息传递,且未对空间关系特征图进行进一步优化。这导致模型错误地关注无关的边缘噪声,从而影响定位性能。为克服这些局限,本文提出跨视角交叉注意力模块(CVCAM),该模块通过多次迭代实现双视角间的交互,持续交换并学习来自双视角的查询物体上下文信息。这有助于深化对跨视角关系的理解,同时抑制与查询物体无关的边缘噪声。此外,我们整合了多头空间注意力模块(MHSAM),该模块采用多种尺寸的卷积核从蕴含隐式对应关系的特征图中提取多尺度空间特征,进一步增强查询物体的特征表示。另外,针对跨视角物体地理定位数据集稀缺的问题,我们构建了名为G2D的新数据集,专用于“地面-无人机”定位任务,既丰富了现有数据集,也填补了该任务领域的数据空白。在CVOGL和G2D数据集上的大量实验表明,本文提出的方法实现了较高的定位精度,超越了当前最优性能。