Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization

Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combined with the absence of camera parameters led to ambiguity between BEV queries and ground references. To tackle this challenge, we propose a novel Window-to-Window BEV representation learning method, termed W2W-BEV, which adaptively matches BEV queries to ground reference at window-scale. Specifically, predefined BEV embeddings and extracted ground features are segmented into a fixed number of windows, and then most similar ground window is chosen for each BEV feature based on the context-aware window matching strategy. Subsequently, the cross-attention is performed between the matched BEV and ground windows to learn the robust BEV representation. Additionally, we use ground features along with predicted depth information to initialize the BEV embeddings, helping learn more powerful BEV representations. Extensive experimental results on benchmark datasets demonstrate significant superiority of our W2W-BEV over previous state-of-the-art methods under challenging conditions of unknown orientation and limited FoV. Specifically, on the CVUSA dataset with limited Fov of 90 degree and unknown orientation, the W2W-BEV achieve an significant improvement from 47.24% to 64.73 %(+17.49%) in R@1 accuracy.

翻译：跨视角地理定位由于巨大的视角差异面临重大挑战，尤其当地面查询图像具有未知朝向的有限视场时。为弥合跨视角域差异，我们首次探索直接从地面查询图像学习BEV表征。然而，地面与航拍图像间的未知朝向结合相机参数的缺失，导致了BEV查询与地面参考之间的模糊性。为应对此挑战，我们提出一种新颖的窗口到窗口BEV表征学习方法，称为W2W-BEV，该方法在窗口尺度上自适应地将BEV查询与地面参考进行匹配。具体而言，预定义的BEV嵌入和提取的地面特征被分割为固定数量的窗口，随后基于上下文感知的窗口匹配策略为每个BEV特征选择最相似的地面窗口。接着，在匹配的BEV窗口与地面窗口之间执行交叉注意力以学习鲁棒的BEV表征。此外，我们利用地面特征结合预测的深度信息来初始化BEV嵌入，以帮助学习更具表现力的BEV表征。在基准数据集上的大量实验结果表明，在未知朝向和有限视场的挑战性条件下，我们的W2W-BEV方法相比先前最先进方法具有显著优势。具体而言，在视场限制为90度且朝向未知的CVUSA数据集上，W2W-BEV在R@1准确率上实现了从47.24%到64.73% (+17.49%) 的显著提升。