Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality. Recently, several works explore the usage of neural radiance field in this task to improve 3D realness and image fidelity. However, the generalizability of previous NeRF-based methods to out-of-domain audio is limited by the small scale of training data. In this work, we propose GeneFace, a generalized and high-fidelity NeRF-based talking face generation method, which can generate natural results corresponding to various out-of-domain audio. Specifically, we learn a variaitional motion generator on a large lip-reading corpus, and introduce a domain adaptative post-net to calibrate the result. Moreover, we learn a NeRF-based renderer conditioned on the predicted facial motion. A head-aware torso-NeRF is proposed to eliminate the head-torso separation problem. Extensive experiments show that our method achieves more generalized and high-fidelity talking face generation compared to previous methods.
翻译:生成与任意语音音频相对应的逼真视频肖像,是电影制作和虚拟现实领域的关键问题。近期,多项研究探索了在三维真实感和图像保真度方面使用神经辐射场来改进该任务。然而,先前基于NeRF的方法在面对域外音频时,其泛化能力受限于训练数据规模较小。本文提出GeneFace——一种通用且高保真的基于NeRF的说话人脸生成方法,能够针对多种域外音频生成自然结果。具体而言,我们在大规模唇读语料库上学习变分运动生成器,并引入领域自适应后处理网络对结果进行校准。此外,我们学习一个以预测面部运动为条件的基于NeRF的渲染器,并提出头部感知躯干NeRF以消除头-躯干分离问题。大量实验表明,与先前方法相比,本方法实现了更通用且高保真的说话人脸生成。