Erroneous feature matches have severe impact on subsequent camera pose estimation and often require additional, time-costly measures, like RANSAC, for outlier rejection. Our method tackles this challenge by addressing feature matching and pose optimization jointly. To this end, we propose a graph attention network to predict image correspondences along with confidence weights. The resulting matches serve as weighted constraints in a differentiable pose estimation. Training feature matching with gradients from pose optimization naturally learns to down-weight outliers and boosts pose estimation on image pairs compared to SuperGlue by 6.7% on ScanNet. At the same time, it reduces the pose estimation time by over 50% and renders RANSAC iterations unnecessary. Moreover, we integrate information from multiple views by spanning the graph across multiple frames to predict the matches all at once. Multi-view matching combined with end-to-end training improves the pose estimation metrics on Matterport3D by 18.5% compared to SuperGlue.
翻译:错误的特征匹配会对后续相机位姿估计产生严重影响,且通常需要额外耗费时间的方法(如RANSAC)进行离群点剔除。我们的方法通过联合处理特征匹配与位姿优化来应对这一挑战。为此,我们提出一种图注意力网络,用于预测图像对应关系及其置信权重。得到的匹配结果作为带权约束,应用于可微位姿估计中。利用来自位姿优化的梯度训练特征匹配,能够自然地学习降低离群点权重,并在ScanNet数据集上相比SuperGlue将图像对位姿估计性能提升6.7%。同时,该方法将位姿估计时间降低超过50%,并避免了RANSAC迭代。此外,我们通过跨多帧构建图结构来整合多视图信息,一次性预测所有匹配。结合端到端训练的多视图匹配,在Matterport3D数据集上相比SuperGlue将位姿估计指标提升了18.5%。