Monocular scene reconstruction from posed images is challenging due to the complexity of a large environment. Recent volumetric methods learn to directly predict the TSDF volume and have demonstrated promising results in this task. However, most methods focus on how to extract and fuse the 2D features to a 3D feature volume, but none of them improve the way how the 3D volume is aggregated. In this work, we propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation. To reduce the explosive computation complexity of the 3D multi-head attention, we propose a sparse window attention module, where the attention is only calculated between the non-empty voxels within a local window. Then a top-down-bottom-up 3D attention network is built for 3D feature aggregation, where a dilate-attention structure is proposed to prevent geometry degeneration, and two global modules are employed to equip with global receptive fields. The experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction, which outperforms previous methods by a large margin. Remarkably, the mesh accuracy is improved by 41.8%, and the mesh completeness is improved by 25.3% on the ScanNet dataset. Project page: https://weihaosky.github.io/sdfformer.
翻译:从已配准图像进行单目场景重建由于大环境的复杂性而具有挑战性。最近的体素方法学习直接预测TSDF体素,并在该任务中展现了令人瞩目的成果。然而,大多数方法聚焦于如何提取二维特征并将其融合至三维特征体素,但均未改进三维体素的聚合方式。本文提出一种SDF Transformer网络,替代三维CNN以增强三维特征聚合能力。为降低三维多头自注意力的爆炸性计算复杂度,我们提出稀疏窗口注意力模块,仅计算局部窗口内非空体素之间的注意力。进而构建自上而下-自下而上的三维注意力网络用于特征聚合,其中提出膨胀注意力结构以防止几何退化,并采用两个全局模块赋予全局感受野。多数据集实验表明,该三维Transformer网络能生成更精确、更完整的重建结果,大幅超越先前方法。值得注意的是,在ScanNet数据集上,网格精度提升41.8%,网格完整性提升25.3%。项目页面:https://weihaosky.github.io/sdfformer。