We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at: https://github.com/huiyu-gao/VisFusion
翻译:我们提出VisFusion,一种基于可见性感知的在线3D场景重建方法,该方法从带位姿的单目视频中重建场景。具体而言,我们旨在从体素特征中重建场景。不同于以往从输入视角为每个体素聚合特征而不考虑其可见性的重建方法,我们通过从每个图像对中投影特征的相似性矩阵显式推断体素的可见性,从而改进特征融合。遵循先前工作,我们的模型采用由粗到精的流水线,包含体素稀疏化过程。与那些使用固定占用阈值全局稀疏化体素的方法不同,我们沿每条视觉射线对局部特征体素进行稀疏化,以保留每条射线至少一个体素,从而捕获更精细的细节。随后将稀疏局部体素与全局体素融合以实现在线重建。我们进一步提出通过跨尺度学习残差以由粗到精的方式预测TSDF,从而获得更优的TSDF预测结果。在基准数据集上的实验表明,我们的方法能够以更丰富的场景细节取得优越性能。代码已开源:https://github.com/huiyu-gao/VisFusion