High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for training, and propose Dataset Latent Modulation (DLM) to seamlessly integrate these heterogeneous data sources. Specifically, by conditioning the core network on learnable source-aware tokens, we decouple dataset-specific styles from physical delighting principles, enabling the emergence of a delighting prior that outperforms existing proprietary models. This powerful delighting prior enables a simple and automatic appearance capture pipeline that achieves high-quality reflectance estimation from casual video inputs, outperforming prior arts by a large margin. Furthermore, we leverage our appearance capture method to transform the multi-view NeRSemble dataset into NeRSemble-Scan, a large-scale collection of 4K-resolution relightable scans. By open-sourcing our model and the NeRSemble-Scan dataset, we democratize high-end facial capture and provide a new foundation for the research community to build photorealistic digital humans.
翻译:高质量的人脸外观捕捉传统上需要昂贵的影棚录制。近期工作考虑了基于智能手机的野外设置,但其基于模型的逆向渲染方法在处理反射率与未知光照的复杂解耦时面临挑战。为弥合这一差距,我们提出将范式转变为训练一个强大的脱光网络作为先验来约束优化过程。我们利用OLAT数据集和渲染的光照阶段扫描进行训练,并提出数据集潜在调制(DLM)以无缝整合这些异构数据源。具体而言,通过让核心网络以可学习的源感知标记为条件,我们将数据集特定风格与物理脱光原理解耦,从而涌现出超越现有专有模型的脱光先验。这一强大的脱光先验使得简单自动的外观捕捉流程成为可能,能够从随意录制的视频输入中实现高质量的反射率估计,大幅优于先前方法。此外,我们利用外观捕捉方法将多视角NeRSemble数据集转化为NeRSemble-Scan——一个大规模4K分辨率可重照明扫描集合。通过开源我们的模型和NeRSemble-Scan数据集,我们推动高端人脸捕捉的普及,并为研究社区构建照片级逼真数字人提供新基础。