We propose a novel end-to-end method for cross-view pose estimation. Given a ground-level query image and an aerial image that covers the query's local neighborhood, the 3 Degrees-of-Freedom camera pose of the query is estimated by matching its image descriptor to descriptors of local regions within the aerial image. The orientation-aware descriptors are obtained by using a translationally equivariant convolutional ground image encoder and contrastive learning. The Localization Decoder produces a dense probability distribution in a coarse-to-fine manner with a novel Localization Matching Upsampling module. A smaller Orientation Decoder produces a vector field to condition the orientation estimate on the localization. Our method is validated on the VIGOR and KITTI datasets, where it surpasses the state-of-the-art baseline by 72% and 36% in median localization error for comparable orientation estimation accuracy. The predicted probability distribution can represent localization ambiguity, and enables rejecting possible erroneous predictions. Without re-training, the model can infer on ground images with different field of views and utilize orientation priors if available. On the Oxford RobotCar dataset, our method can reliably estimate the ego-vehicle's pose over time, achieving a median localization error under 1 meter and a median orientation error of around 1 degree at 14 FPS.
翻译:我们提出了一种新颖的端到端交叉视角姿态估计方法。给定一张地面查询图像与覆盖该查询局部邻域的航拍图像,通过将查询图像描述符与航拍图像内局部区域描述符进行匹配,可估计查询图像的三自由度相机姿态。通过使用平移等变卷积地面图像编码器与对比学习,可获得方向感知的描述符。定位解码器采用新颖的定位匹配上采样模块,以粗到细的方式生成稠密概率分布;较小的方向解码器则生成向量场,使方向估计依赖于定位结果。该方法在VIGOR与KITTI数据集上得到验证,在方向估计精度相近的条件下,中位定位误差相较于当前最优基线分别降低72%与36%。预测的概率分布可表征定位歧义性,并支持排除潜在错误预测。无需重新训练,模型即可推断不同视场角的地面图像,并利用可获取的方向先验信息。在Oxford RobotCar数据集上,该方法能可靠估计自车随时间变化的姿态,在14FPS处理速度下实现中位定位误差低于1米、中位方向误差约1度的性能。