Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost.
翻译:从图像重建高保真3D头部几何结构对众多应用至关重要,但现有方法面临根本性局限。传统摄影测量法虽能达到极高精细度,却需要庞大的相机阵列(25-200+视角)、大量计算资源,并在处理面部毛发等复杂区域时需要人工清理。近期替代方案面临根本性权衡:基础模型支持高效单图像重建但缺乏几何细节,而基于优化的方法虽能实现更高保真度却需要密集视角和昂贵计算。我们通过融合两种范式的混合方法弥合了这一鸿沟。该方法引入多视角表面法线预测模型,通过跨视角注意力机制扩展单目基础模型,在前馈过程中生成几何一致的法线图。随后将这些预测作为强几何先验嵌入逆渲染优化框架,以恢复高频表面细节。我们的方法在单图像和多视角方法中均达到最优性能,其高保真重建效果可与密集视角摄影测量法媲美,同时大幅降低了相机需求与计算成本。