We present a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings. 6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire visual-inertial dense mapping system on several public datasets as well as our own dataset, demonstrating the system's impressive generalization capabilities and its ability to deliver high-quality 3D mesh reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset.
翻译:我们提出了一种实时视觉惯性稠密建图方法,仅利用连续单目图像和惯性测量单元(IMU)读数即可实现高质量的增量式三维网格重建。通过鲁棒的基于特征的视觉惯性里程计(VIO)估计6自由度相机位姿,并生成含噪的稀疏三维地图点作为副产物。我们提出一种稀疏点辅助的多视图立体神经网络(SPA-MVSNet),能够有效利用VIO系统中信息丰富但含噪的稀疏点。VIO产生的稀疏深度首先通过单视图深度补全网络进行补全。该稠密深度图虽精度天然有限,但作为先验引导MVS网络进行代价体生成和正则化,以实现精确的稠密深度预测。由MVS网络预测的关键帧图像的深度图通过TSDF融合逐步融合到全局地图中。我们在多个公开数据集及自有数据集上对所提出的SPA-MVSNet和整个视觉惯性稠密建图系统进行了全面评估,证明了系统出色的泛化能力及其在线生成高质量三维网格重建的能力。在EuRoC数据集的挑战性场景评估中,所提稠密建图系统的F-score相比现有系统提升了39.7%。