A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.
翻译:视觉同时定位与地图构建(VSLAM)的核心组件之一是利用匹配关键点估计相对相机姿态。噪声对应关系对准确估计构成挑战。经典方法依赖随机假设采样和迭代估计,而基于学习的方法通常缺乏显式几何结构。本文提出将相对姿态估计重新表述为极线对应图上的关系推理问题,其中匹配关键点为节点,邻近节点通过边连接。通过剪枝、消息传递和池化等图操作,估计四元数旋转、平移向量及本质矩阵(EM)。最小化包含以下项的损失函数:(i)与真值(GT)的$\mathcal{L}_2$差异;(ii)估计EM与真值EM之间的Frobenius范数;(iii)奇异值差异;(iv)航向角差异;(v)尺度差异,从而获得图像对间的相对姿态。匹配采用无检测器密集方法LoFTR。在室内外基准上的实验表明,与经典方法及学习引导方法相比,本方法对密集噪声和大基线变化具有更强的鲁棒性,凸显了全局关系共识的有效性。