Monocular scene reconstruction from posed images is challenging due to the complexity of a large environment. Recent volumetric methods learn to directly predict the TSDF volume and have demonstrated promising results in this task. However, most methods focus on how to extract and fuse the 2D features to a 3D feature volume, but none of them improve the way how the 3D volume is aggregated. In this work, we propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation. To reduce the explosive computation complexity of the 3D multi-head attention, we propose a sparse window attention module, where the attention is only calculated between the non-empty voxels within a local window. Then a top-down-bottom-up 3D attention network is built for 3D feature aggregation, where a dilate-attention structure is proposed to prevent geometry degeneration, and two global modules are employed to equip with global receptive fields. The experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction, which outperforms previous methods by a large margin. Remarkably, the mesh accuracy is improved by 41.8%, and the mesh completeness is improved by 25.3% on the ScanNet dataset. Project page: https://weihaosky.github.io/sdfformer.
翻译:从有姿态图像进行单目场景重建由于大规模环境的复杂性而具有挑战性。近期基于体积的方法通过学习直接预测TSDF体积在这一任务中展现出可喜成果。然而,大多数方法关注如何提取并融合2D特征至3D特征体积,但均未改进3D体积的聚合方式。本研究提出SDF Transformer网络,替代3D CNN以实现更优的3D特征聚合。为降低3D多头注意力的爆炸性计算复杂度,我们提出稀疏窗口注意力模块,仅计算局部窗口内非空体素间的注意力。进而构建自顶向下-自底向上3D注意力网络用于3D特征聚合,其中提出膨胀注意力结构以防止几何退化,并采用两个全局模块赋予全局感受野。多个数据集上的实验表明,该3D Transformer网络能生成更精确完整的重建结果,大幅超越先前方法。值得注意的是,在ScanNet数据集上网格精度提升41.8%,网格完整性提升25.3%。项目主页:https://weihaosky.github.io/sdfformer。