Real-time monocular 3D reconstruction is a challenging problem that remains unsolved. Although recent end-to-end methods have demonstrated promising results, tiny structures and geometric boundaries are hardly captured due to their insufficient supervision neglecting spatial details and oversimplified feature fusion ignoring temporal cues. To address the problems, we propose an end-to-end 3D reconstruction network SST, which utilizes Sparse estimated points from visual SLAM system as additional Spatial guidance and fuses Temporal features via a novel cross-modal attention mechanism, achieving more detailed reconstruction results. We propose a Local Spatial-Temporal Fusion module to exploit more informative spatial-temporal cues from multi-view color information and sparse priors, as well a Global Spatial-Temporal Fusion module to refine the local TSDF volumes with the world-frame model from coarse to fine. Extensive experiments on ScanNet and 7-Scenes demonstrate that SST outperforms all state-of-the-art competitors, whilst keeping a high inference speed at 59 FPS, enabling real-world applications with real-time requirements.
翻译:实时单目三维重建是一个尚未解决且具有挑战性的问题。尽管近期端到端方法已展现出令人鼓舞的结果,但由于其监督信号不足导致忽略空间细节,以及特征融合过度简化而忽视时间线索,微小结构和几何边界难以被捕捉。为解决这些问题,我们提出了一种端到端三维重建网络SST,该网络利用视觉SLAM系统稀疏估计的点作为额外空间引导,并通过一种新颖的跨模态注意力机制融合时间特征,从而实现更精细的重建结果。我们提出了局部时空融合模块,用于从多视角颜色信息和稀疏先验中挖掘更丰富的时空线索,同时提出了全局时空融合模块,用于将世界坐标系下的模型从粗到细地优化局部TSDF体。在ScanNet和7-Scenes上的大量实验表明,SST在保持59 FPS高推理速度的同时,优于所有最先进的竞争方法,从而能够应用于具有实时需求的现实场景。