Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk
翻译:在合成逼真且语音驱动的说话人物视频中实现高度同步是一项重大挑战。传统的生成对抗网络(GAN)难以保持一致的面部身份,而神经辐射场(NeRF)方法虽能解决此问题,但常导致嘴部动作不匹配、面部表情不足和头部姿态不稳定。逼真的说话人物需要主体身份、嘴部动作、面部表情和头部姿态的同步协调。这些同步的缺失是根本性缺陷,会导致结果不真实且机械。为解决这一关键同步问题——被誉为创造逼真说话人物的“魔鬼”——我们提出了SyncTalk。这种基于NeRF的方法能有效保持主体身份,增强说话人物合成中的同步性与真实感。SyncTalk采用面部同步控制器(Face-Sync Controller)使嘴部动作与语音对齐,并创新地使用3D面部混合形状模型捕捉精准的面部表情。我们的头部同步稳定器(Head-Sync Stabilizer)优化了头部姿态,实现更自然的头部运动。肖像同步生成器(Portrait-Sync Generator)则恢复头发细节并将生成头部与躯干无缝融合。大量实验和用户研究表明,SyncTalk在同步性和真实感上均优于现有最优方法。建议观看补充视频:https://ziqiaopeng.github.io/synctalk