GeLoc3r：通过几何一致性正则化增强相对相机位姿回归 (GeLoc3r: Enhancing Relative Camera Pose Regression with Geometric Consistency Regularization)

Prior ReLoc3R achieves breakthrough performance with fast 25ms inference and state-of-the-art regression accuracy, yet our analysis reveals subtle geometric inconsistencies in its internal representations that prevent reaching the precision ceiling of correspondence-based methods like MASt3R (which require 300ms per pair). In this work, we present GeLoc3r, a novel approach to relative camera pose estimation that enhances pose regression methods through Geometric Consistency Regularization (GCR). GeLoc3r overcomes the speed-accuracy dilemma by training regression networks to produce geometrically consistent poses without inference-time geometric computation. During training, GeLoc3r leverages ground-truth depth to generate dense 3D-2D correspondences, weights them using a FusionTransformer that learns correspondence importance, and computes geometrically-consistent poses via weighted RANSAC. This creates a consistency loss that transfers geometric knowledge into the regression network. Unlike FAR method which requires both regression and geometric solving at inference, GeLoc3r only uses the enhanced regression head at test time, maintaining ReLoc3R's fast speed and approaching MASt3R's high accuracy. On challenging benchmarks, GeLoc3r consistently outperforms ReLoc3R, achieving significant improvements including 40.45% vs. 34.85% AUC@5° on the CO3Dv2 dataset (16% relative improvement), 68.66% vs. 66.70% AUC@5° on RealEstate10K, and 50.45% vs. 49.60% on MegaDepth1500. By teaching geometric consistency during training rather than enforcing it at inference, GeLoc3r represents a paradigm shift in how neural networks learn camera geometry, achieving both the speed of regression and the geometric understanding of correspondence methods.

翻译：先前的工作ReLoc3R以25毫秒的快速推理速度和最先进的回归精度取得了突破性性能，然而我们的分析揭示了其内部表示中存在细微的几何不一致性，这阻碍了其达到基于对应点方法（如MASt3R，每对图像需要300毫秒）的精度上限。在本工作中，我们提出了GeLoc3r，一种用于相对相机位姿估计的新方法，它通过几何一致性正则化（GCR）来增强位姿回归方法。GeLoc3r通过训练回归网络以产生几何一致的位姿，而无需在推理时进行几何计算，从而克服了速度与精度之间的权衡。在训练期间，GeLoc3r利用真实深度生成密集的3D-2D对应点，使用一个学习对应点重要性的FusionTransformer对它们进行加权，并通过加权RANSAC计算几何一致的位姿。这产生了一个一致性损失，将几何知识迁移到回归网络中。与FAR方法在推理时需要同时进行回归和几何求解不同，GeLoc3r在测试时仅使用增强的回归头，保持了ReLoc3R的快速速度，并接近MASt3R的高精度。在具有挑战性的基准测试中，GeLoc3r始终优于ReLoc3R，取得了显著改进，包括在CO3Dv2数据集上AUC@5°达到40.45%对34.85%（相对改进16%），在RealEstate10K上达到68.66%对66.70%，在MegaDepth1500上达到50.45%对49.60%。通过在训练中教授几何一致性，而不是在推理时强制执行，GeLoc3r代表了神经网络学习相机几何方式的范式转变，同时实现了回归方法的速度和对应点方法的几何理解。