Combining the signed distance function (SDF) and differentiable volume rendering has emerged as a powerful paradigm for surface reconstruction from multi-view images without 3D supervision. However, current methods are impeded by requiring long-time per-scene optimizations and cannot generalize to new scenes. In this paper, we present GenS, an end-to-end generalizable neural surface reconstruction model. Unlike coordinate-based methods that train a separate network for each scene, we construct a generalized multi-scale volume to directly encode all scenes. Compared with existing solutions, our representation is more powerful, which can recover high-frequency details while maintaining global smoothness. Meanwhile, we introduce a multi-scale feature-metric consistency to impose the multi-view consistency in a more discriminative multi-scale feature space, which is robust to the failures of the photometric consistency. And the learnable feature can be self-enhanced to continuously improve the matching accuracy and mitigate aggregation ambiguity. Furthermore, we design a view contrast loss to force the model to be robust to those regions covered by few viewpoints through distilling the geometric prior from dense input to sparse input. Extensive experiments on popular benchmarks show that our model can generalize well to new scenes and outperform existing state-of-the-art methods even those employing ground-truth depth supervision. Code is available at https://github.com/prstrive/GenS.
翻译:结合符号距离函数(SDF)与可微分体渲染,已成为无需三维监督、仅从多视角图像进行表面重建的强大范式。然而,现有方法因需对每个场景进行长时间优化而受限,且无法泛化到新场景。本文提出GenS,一种端到端的可泛化神经表面重建模型。与为每个场景单独训练网络的基于坐标的方法不同,我们构建了一个广义多尺度体表示来直接编码所有场景。与现有方案相比,我们的表示能力更强,能够在保持全局平滑性的同时恢复高频细节。同时,我们引入了一种多尺度特征度量一致性约束,在更具判别力的多尺度特征空间中施加多视角一致性,其对光度一致性的失效具有鲁棒性。并且可学习的特征能够自我增强,以持续提升匹配精度并缓解聚合歧义。此外,我们设计了一种视角对比损失,通过将几何先验从密集输入蒸馏到稀疏输入,迫使模型对那些仅被少数视角覆盖的区域具有鲁棒性。在多个主流基准上的大量实验表明,我们的模型能够很好地泛化到新场景,并且性能优于现有的最先进方法,甚至优于那些使用了真实深度监督的方法。代码发布于 https://github.com/prstrive/GenS。