This paper addresses the task of Unmanned Aerial Vehicles (UAV) visual geo-localization, which aims to match images of the same geographic target taken by different platforms, i.e., UAVs and satellites. In general, the key to achieving accurate UAV-satellite image matching lies in extracting visual features that are robust against viewpoint changes, scale variations, and rotations. Current works have shown that part matching is crucial for UAV visual geo-localization since part-level representations can capture image details and help to understand the semantic information of scenes. However, the importance of preserving semantic characteristics in part-level representations is not well discussed. In this paper, we introduce a transformer-based adaptive semantic aggregation method that regards parts as the most representative semantics in an image. Correlations of image patches to different parts are learned in terms of the transformer's feature map. Then our method decomposes part-level features into an adaptive sum of all patch features. By doing this, the learned parts are encouraged to focus on patches with typical semantics. Extensive experiments on the University-1652 dataset have shown the superiority of our method over the current works.
翻译:本文研究了无人机(UAV)视觉地理定位任务,该任务旨在匹配由不同平台(即无人机和卫星)拍摄的同一地理目标的图像。通常,实现精确的无人机-卫星图像匹配的关键在于提取对视角变化、尺度变化和旋转具有鲁棒性的视觉特征。现有研究表明,局部匹配对无人机视觉地理定位至关重要,因为局部级表示能够捕捉图像细节,并有助于理解场景的语义信息。然而,在局部级表示中保留语义特征的重要性尚未得到充分探讨。本文提出了一种基于Transformer的自适应语义聚合方法,将局部视为图像中最具代表性的语义。通过Transformer的特征图,学习图像块与不同局部之间的相关性。随后,我们的方法将局部级特征分解为所有图像块特征的自适应和。通过这种方式,所学局部被鼓励关注具有典型语义的图像块。在University-1652数据集上的大量实验表明,我们的方法优于现有研究成果。