Indoor scene reconstruction from monocular images has long been sought after by augmented reality and robotics developers. Recent advances in neural field representations and monocular priors have led to remarkable results in scene-level surface reconstructions. The reliance on Multilayer Perceptrons (MLP), however, significantly limits speed in training and rendering. In this work, we propose to directly use signed distance function (SDF) in sparse voxel block grids for fast and accurate scene reconstruction without MLPs. Our globally sparse and locally dense data structure exploits surfaces' spatial sparsity, enables cache-friendly queries, and allows direct extensions to multi-modal data such as color and semantic labels. To apply this representation to monocular scene reconstruction, we develop a scale calibration algorithm for fast geometric initialization from monocular depth priors. We apply differentiable volume rendering from this initialization to refine details with fast convergence. We also introduce efficient high-dimensional Continuous Random Fields (CRFs) to further exploit the semantic-geometry consistency between scene objects. Experiments show that our approach is 10x faster in training and 100x faster in rendering while achieving comparable accuracy to state-of-the-art neural implicit methods.
翻译:单目图像的室内场景重建长期以来一直是增强现实和机器人开发者的追求目标。神经场表示与单目先验的最新进展在场景级表面重建中取得了显著成果。然而,对多层感知机(MLP)的依赖严重限制了训练和渲染速度。本文提出直接使用稀疏体素块网格中的带符号距离函数(SDF)进行快速准确的场景重建,无需MLP。我们设计的全局稀疏与局部密集数据结构充分利用了表面的空间稀疏性,支持缓存友好的查询,并可直接扩展至颜色与语义标签等多模态数据。为将该表示应用于单目场景重建,我们开发了一种尺度校准算法,可从单目深度先验中快速初始化几何结构。在此基础上,我们采用可微体渲染以快速收敛并精化细节。同时引入高效高维条件随机场(CRFs),进一步挖掘场景物体间的语义-几何一致性。实验表明,本方法在训练速度上提升10倍、渲染速度提升100倍的同时,能达到与最先进的神经隐式方法相当的精度。