SIM-Sync: From Certifiably Optimal Synchronization over the 3D Similarity Group to Scene Reconstruction with Learned Depth

This paper presents SIM-Sync, a certifiably optimal algorithm that estimates camera trajectory and 3D scene structure directly from multiview image keypoints. SIM-Sync fills the gap between pose graph optimization and bundle adjustment; the former admits efficient global optimization but requires relative pose measurements and the latter directly consumes image keypoints but is difficult to optimize globally (due to camera projective geometry). The bridge to this gap is a pretrained depth prediction network. Given a graph with nodes representing monocular images taken at unknown camera poses and edges containing pairwise image keypoint correspondences, SIM-Sync first uses a pretrained depth prediction network to lift the 2D keypoints into 3D scaled point clouds, where the scaling of the per-image point cloud is unknown due to the scale ambiguity in monocular depth prediction. SIM-Sync then seeks to synchronize jointly the unknown camera poses and scaling factors (i.e., over the 3D similarity group). The SIM-Sync formulation, despite nonconvex, allows designing an efficient certifiably optimal solver that is almost identical to the SE-Sync algorithm. We demonstrate the tightness, robustness, and practical usefulness of SIM-Sync in both simulated and real experiments. In simulation, we show (i) SIM-Sync compares favorably with SE-Sync in scale-free synchronization, and (ii) SIM-Sync can be used together with robust estimators to tolerate a high amount of outliers. In real experiments, we show (a) SIM-Sync achieves similar performance as Ceres on bundle adjustment datasets, and (b) SIM-Sync performs on par with ORB-SLAM3 on the TUM dataset with zero-shot depth prediction.

翻译：摘要：本文提出SIM-Sync，一种直接从多视图图像关键点估计相机轨迹和3D场景结构的可证最优算法。SIM-Sync填补了位姿图优化与光束法平差之间的空白：前者可实现高效全局优化但需相对位姿测量，后者虽直接消耗图像关键点却难以进行全局优化（因相机投影几何特性）。填补这一空白的桥梁是预训练深度预测网络。给定一个图结构，其节点表示未知相机位姿下的单目图像，边包含成对图像关键点对应关系，SIM-Sync首先利用预训练深度预测网络将2D关键点提升为3D缩放点云，其中每幅图像点云的缩放因子因单目深度预测的尺度模糊性而未知。随后SIM-Sync旨在联合同步未知相机位姿与缩放因子（即在3D相似群上）。尽管SIM-Sync的数学表述是非凸的，但仍可设计出与SE-Sync算法几乎相同的可证最优高效求解器。我们在仿真与真实实验中验证了SIM-Sync的紧致性、鲁棒性及实践价值。仿真实验表明：（i）在无尺度同步任务中SIM-Sync性能优于SE-Sync；（ii）SIM-Sync可配合鲁棒估计器耐受高比例异常值。真实实验显示：（a）在光束法平差数据集上SIM-Sync性能与Ceres相当；（b）结合零样本深度预测时，SIM-Sync在TUM数据集上表现与ORB-SLAM3持平。