Uncertainty-Driven Dense Two-View Structure from Motion

This work introduces an effective and practical solution to the dense two-view structure from motion (SfM) problem. One vital question addressed is how to mindfully use per-pixel optical flow correspondence between two frames for accurate pose estimation -- as perfect per-pixel correspondence between two images is difficult, if not impossible, to establish. With the carefully estimated camera pose and predicted per-pixel optical flow correspondences, a dense depth of the scene is computed. Later, an iterative refinement procedure is introduced to further improve optical flow matching confidence, camera pose, and depth, exploiting their inherent dependency in rigid SfM. The fundamental idea presented is to benefit from per-pixel uncertainty in the optical flow estimation and provide robustness to the dense SfM system via an online refinement. Concretely, we introduce our uncertainty-driven Dense Two-View SfM pipeline (DTV-SfM), consisting of an uncertainty-aware dense optical flow estimation approach that provides per-pixel correspondence with their confidence score of matching; a weighted dense bundle adjustment formulation that depends on optical flow uncertainty and bidirectional optical flow consistency to refine both pose and depth; a depth estimation network that considers its consistency with the estimated poses and optical flow respecting epipolar constraint. Extensive experiments show that the proposed approach achieves remarkable depth accuracy and state-of-the-art camera pose results superseding SuperPoint and SuperGlue accuracy when tested on benchmark datasets such as DeMoN, YFCC100M, and ScanNet. Code and more materials are available at http://vis.xyz/pub/dtv-sfm.

翻译：本文提出了一种针对密集双目运动恢复结构（SfM）问题的有效且实用的解决方案。其中关键问题在于如何充分利用两帧之间逐像素光流对应关系进行精确位姿估计——因为在两幅图像间建立完美的逐像素对应关系即便并非不可能，也极具挑战性。通过精确估计的相机位姿和预测的逐像素光流对应关系，我们计算出场景的密集深度。随后引入迭代优化流程，利用刚性SfM中光流匹配置信度、相机位姿与深度之间的内在依赖关系，进一步提升各项精度。核心思路在于利用光流估计中的逐像素不确定性，通过在线优化增强密集SfM系统的鲁棒性。具体而言，我们提出了不确定性驱动的Dense Two-View SfM（DTV-SfM）流水线，包含：基于不确定性感知的密集光流估计方法，可提供逐像素对应关系及其匹配置信度；依赖光流不确定性与双向光流一致性的加权密集光束法平差公式，用于联合优化位姿与深度；以及考虑位姿和光流极线约束一致性的深度估计网络。大量实验表明，在DeMoN、YFCC100M和ScanNet等基准数据集上，本方法达到了卓越的深度估计精度，相机位姿结果超越SuperPoint与SuperGlue，取得当前最优性能。代码及更多资料见http://vis.xyz/pub/dtv-sfm。