The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.
翻译:摘要:从语音合成三维面部动画已引起广泛关注。由于高质量四维面部数据及标注丰富的多模态标签的稀缺,以往方法常受限于逼真度不足且缺乏灵活的条件控制。我们通过三部曲解决这一挑战。首先提出广义神经参数化面部资产(GNPFA),这是一种高效的变分自编码器,将面部几何与图像映射至高度泛化的表情隐空间,实现表情与身份的解耦。继而利用GNPFA从大量视频中提取高质量表情与精确头部姿态,由此构建M2F-D数据集——一个大规模、多样化且具备扫描级精度的语音同步三维面部动画数据集,并配有标注完善的情感与风格标签。最后提出Media2Face——基于GNPFA隐空间的扩散模型,用于生成语音同步面部动画,支持来自音频、文本与图像的丰富多模态引导。大量实验表明,本模型不仅实现了高保真度的面部动画合成,更拓展了三维面部动画中表情表现力与风格适应性的边界。