Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.
翻译:结构光测(SfM)是计算机视觉领域长期存在的问题,旨在从一组无约束的二维图像中重建场景的相机姿态与三维结构。经典框架通过检测与匹配关键点、注册图像、三角化三维点以及执行光束法平差,以增量方式解决该问题。近年来的研究主要集中于利用深度学习技术增强特定环节(例如关键点匹配),但仍基于原始不可微的流水线。相反,我们提出了一种全新的深度流水线VGGSfM,其每个组件完全可微,因此可以以端到端方式训练。为此,我们引入了新机制与简化策略。首先,我们基于深度二维点追踪的最新进展提取可靠的像素级精确轨迹,从而消除了串联成对匹配的需求。其次,我们基于图像与轨迹特征同时恢复所有相机,而非逐步注册相机。最后,我们通过可微的光束法平差层优化相机并三角化三维点。我们在三个公开数据集CO3D、IMC Phototourism和ETH3D上取得了最优性能。