Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memory consumption of high-resolution feature volumes. This letter proposes a real-time feature volume-based dense reconstruction method that predicts TSDF (Truncated Signed Distance Function) values from a novel sparsified deep feature volume, which is able to achieve higher resolutions than previous feature volume-based methods, and is favorable in large-scale outdoor scenarios where the majority of voxels are empty. An uncertainty-aware multi-view stereo (MVS) network is leveraged to infer initial voxel locations of the physical surface in a sparse feature volume. Then for refining the recovered 3D geometry, deep features are attentively aggregated from multiview images at potential surface locations, and temporally fused. Besides achieving higher resolutions than before, our method is shown to produce more complete reconstructions with finer detail in many cases. Extensive evaluations on both public and self-collected datasets demonstrate a very competitive real-time reconstruction result for our method compared to state-of-the-art reconstruction methods in both indoor and outdoor settings.
翻译:从单目视频中增量式恢复三维密集结构对于机器人技术和增强现实等应用至关重要。特征体方法已被证明能够在不预先估计深度的情况下实现高效准确的增量式密集重建,但由于高分辨率特征体的大内存消耗,其分辨率难以达到基于深度的方法的水平。本文提出一种基于特征体的实时密集重建方法,通过新型稀疏化深度特征体预测截断符号距离函数(TSDF)值,能够实现比以往特征体方法更高的分辨率,并在大多数体素为空的户外大尺度场景中具有优势。该方法利用不确定性感知的多视图立体(MVS)网络推断稀疏特征体中物理表面的初始体素位置;随后,通过注意力机制从潜在表面位置的多视图图像中聚合深度特征并进行时序融合,以精化恢复的三维几何结构。除实现更高分辨率外,我们的方法在许多场景中能够产生更完整且细节更精细的重建结果。在公开数据集和自建数据集上的广泛评估表明,与室内外场景中最先进的重建方法相比,我们的方法在实时重建性能上具有显著竞争力。