In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors, referred to as Bundle Adjustment (BA), starting from a good initialization. In order to obtain a good enough initialization to BA, conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation, pose averaging or triangulation) which provides an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods, and challenges COLMAP while having lower runtime.
翻译:本文通过图注意力网络解决运动恢复结构的学习问题。运动恢复结构是计算机视觉经典问题,通常通过迭代最小化重投影误差(即光束法平差)解决,需从良好的初始值开始。为获得光束法平差所需的优质初始值,传统方法依赖一系列子问题(如两视角位姿估计、位姿平均或三角化),先提供初始解再通过光束法平差精化。本研究通过训练模型替代这些子问题——该模型以多视图检测的二维关键点为输入,直接输出对应相机位姿与三维关键点坐标。模型利用图神经网络学习运动恢复结构特定原语,实验证明其对未知新序列可实现快速重建推理。结果表明,所提模型性能优于同类学习方法,并以更低运行时间挑战COLMAP。