Cross-View Geo-Localisation is still a challenging task where additional modules, specific pre-processing or zooming strategies are necessary to determine accurate positions of images. Since different views have different geometries, pre-processing like polar transformation helps to merge them. However, this results in distorted images which then have to be rectified. Adding hard negatives to the training batch could improve the overall performance but with the default loss functions in geo-localisation it is difficult to include them. In this article, we present a simplified but effective architecture based on contrastive learning with symmetric InfoNCE loss that outperforms current state-of-the-art results. Our framework consists of a narrow training pipeline that eliminates the need of using aggregation modules, avoids further pre-processing steps and even increases the generalisation capability of the model to unknown regions. We introduce two types of sampling strategies for hard negatives. The first explicitly exploits geographically neighboring locations to provide a good starting point. The second leverages the visual similarity between the image embeddings in order to mine hard negative samples. Our work shows excellent performance on common cross-view datasets like CVUSA, CVACT, University-1652 and VIGOR. A comparison between cross-area and same-area settings demonstrate the good generalisation capability of our model.
翻译:跨视角地理定位仍是一项具有挑战性的任务,需要借助额外模块、特定预处理或缩放策略才能确定图像的精确位置。由于不同视角具有不同的几何结构,极坐标变换等预处理有助于对齐视角,但这会导致图像变形,必须进行修正。在训练批次中引入难负样本可提升整体性能,但地理定位中默认的损失函数难以兼容此类样本。本文提出一种基于对比学习的简化高效架构,采用对称InfoNCE损失函数,其性能优于当前最先进的方法。该框架采用精简的训练流程,无需使用聚合模块,避免了额外的预处理步骤,甚至增强了模型对未知区域的泛化能力。我们引入两种难负样本采样策略:第一种显式利用地理邻近位置作为起点,第二种基于图像嵌入的视觉相似性挖掘难负样本。在CVUSA、CVACT、University-1652和VIGOR等常见跨视角数据集上的实验表明,本方法性能优异。跨区域与同区域设置的对比结果验证了模型出色的泛化能力。