To find the geolocation of a street-view image, cross-view geolocalization (CVGL) methods typically perform image retrieval on a database of georeferenced aerial images and determine the location from the visually most similar match. Recent approaches focus mainly on settings where street-view and aerial images are preselected to align w.r.t. translation or orientation, but struggle in challenging real-world scenarios where varying camera poses have to be matched to the same aerial image. We propose a novel trainable retrieval architecture that uses bird's eye view (BEV) maps rather than vectors as embedding representation, and explicitly addresses the many-to-one ambiguity that arises in real-world scenarios. The BEV-based retrieval is trained using the same contrastive setting and loss as classical retrieval. Our method C-BEV surpasses the state-of-the-art on the retrieval task on multiple datasets by a large margin. It is particularly effective in challenging many-to-one scenarios, e.g. increasing the top-1 recall on VIGOR's cross-area split with unknown orientation from 31.1% to 65.0%. Although the model is supervised only through a contrastive objective applied on image pairings, it additionally learns to infer the 3-DoF camera pose on the matching aerial image, and even yields a lower mean pose error than recent methods that are explicitly trained with metric groundtruth.
翻译:为确定街景图像的地理位置,跨视角地理定位(CVGL)方法通常对地理参考航拍图像数据库进行图像检索,并从视觉最相似的匹配项中定位。近期方法主要聚焦于街景与航拍图像在平移或方向上预先对齐的场景,但在实际挑战场景中,不同相机位姿需与同一航拍图像匹配时效果欠佳。本文提出一种新型可训练检索架构,采用鸟瞰图(BEV)而非向量作为嵌入表征,并明确处理实际场景中出现的多对一歧义问题。该基于BEV的检索采用与经典检索相同的对比学习框架和损失函数进行训练。我们的方法C-BEV在多个数据集上的检索任务中大幅超越现有最优方法。其在多对一挑战性场景中尤为有效:例如在VIGOR跨区域划分且方向未知的设置下,将top-1召回率从31.1%提升至65.0%。尽管模型仅通过图像配对的对比目标函数进行监督,它仍能学会在匹配的航拍图像上推断三自由度相机位姿,且其位姿误差均值甚至低于近年来明确使用度量真值训练的方法。