In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors, referred to as Bundle Adjustment (BA), starting from a good initialization. In order to obtain a good enough initialization to BA, conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation, pose averaging or triangulation) which provides an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods, and challenges COLMAP while having lower runtime.
翻译:本文探讨了通过图注意力网络学习运动恢复结构(Structure-from-Motion, SfM)的问题。SfM是一个经典的计算机视觉问题,通常通过迭代最小化重投影误差(即光束法平差,Bundle Adjustment, BA)求解,且需要良好的初始值。为获得足够好的BA初始值,传统方法依赖于一系列子问题(如两视角位姿估计、位姿平均或三角化)以提供初始解,再通过BA进行优化。在本工作中,我们通过训练一个模型来替代这些子问题:该模型以多视角检测到的2D关键点为输入,直接输出对应的相机位姿和3D关键点坐标。我们的模型利用图神经网络学习SfM特定的基元,结果表明该模型可用于新序列重建的快速推断。实验证明,所提模型性能优于基于学习的对比方法,且在与COLMAP竞争的同时具有更低的运行时间。