GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance

Although existing speech-driven talking face generation methods achieve significant progress, they are far from real-world application due to the avatar-specific training demand and unstable lip movements. To address the above issues, we propose the GSmoothFace, a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model, which can synthesize smooth lip dynamics while preserving the speaker's identity. Our proposed GSmoothFace model mainly consists of the Audio to Expression Prediction (A2EP) module and the Target Adaptive Face Translation (TAFT) module. Specifically, we first develop the A2EP module to predict expression parameters synchronized with the driven speech. It uses a transformer to capture the long-term audio context and learns the parameters from the fine-grained 3D facial vertices, resulting in accurate and smooth lip-synchronization performance. Afterward, the well-designed TAFT module, empowered by Morphology Augmented Face Blending (MAFB), takes the predicted expression parameters and target video as inputs to modify the facial region of the target video without distorting the background content. The TAFT effectively exploits the identity appearance and background context in the target video, which makes it possible to generalize to different speakers without retraining. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality. See the project page for code, data, and request pre-trained models: https://zhanghm1995.github.io/GSmoothFace.

翻译：尽管现有的语音驱动说话人脸生成方法取得了显著进展，但由于需要针对特定人像进行训练且唇部运动不稳定，它们仍远未达到实际应用要求。为解决上述问题，我们提出GSmoothFace——一种由细粒度三维人脸模型引导的新型两阶段通用说话人脸生成模型，能够在保持说话人身份特征的同时合成平滑的唇部动态。我们提出的GSmoothFace模型主要由语音到表情预测（A2EP）模块和目标自适应人脸迁移（TAFT）模块组成。具体而言，我们首先开发了A2EP模块来预测与驱动语音同步的表情参数。该模块利用Transformer捕获长期音频上下文，并从细粒度三维人脸顶点中学习参数，从而获得准确且平滑的唇部同步性能。随后，由形态增强人脸融合（MAFB）赋能精心设计的TAFT模块，以预测的表情参数和目标视频作为输入，在不扭曲背景内容的情况下修改目标视频的面部区域。TAFT有效利用了目标视频中的身份外观和背景上下文，使其能够泛化到不同说话人而无需重新训练。定量与定性实验均证实了本方法在真实感、唇部同步和视觉质量方面的优越性。代码、数据及预训练模型请求请参见项目页面：https://zhanghm1995.github.io/GSmoothFace。