We present a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings. 6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire visual-inertial dense mapping system on several public datasets as well as our own dataset, demonstrating the system's impressive generalization capabilities and its ability to deliver high-quality 3D mesh reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset. We plan to release the code of this work upon acceptance.
翻译:我们提出一种实时视觉惯性密集建图方法,仅利用连续单目图像和惯性测量单元(IMU)读数,即可增量式地完成高质量三维网格重建。通过鲁棒的基于特征的视觉惯性里程计(VIO)估计六自由度相机位姿,该方法同时生成含噪的稀疏三维地图点作为副产品。我们提出稀疏点辅助的多视图立体神经网络(SPA-MVSNet),能够有效利用VIO系统中信息丰富但含噪的稀疏点。首先通过单视图深度补全网络对VIO提供的稀疏深度进行补全。尽管该稠密深度图精度固有局限,但可将其作为先验,指导MVS网络在代价体生成与正则化阶段实现精确的稠密深度预测。MVS网络预测的关键帧深度图通过TSDF融合增量式融入全局地图。我们在多个公开数据集及自建数据集上对提出的SPA-MVSNet及整个视觉惯性密集建图系统进行了全面评估,验证了该系统出色的泛化能力及在线生成高质量三维网格重建的能力。在EuRoC数据集挑战性场景上的评估表明,所提密集建图系统的F分数较现有系统提升39.7%。我们计划在论文接收后开源该工作的代码。