SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency. Our demo is available at http://swaptalk.cc.

翻译：摘要：将人脸交换与唇部同步技术相结合，为定制化说话人脸生成提供了一种经济高效的解决方案。然而，直接级联现有模型容易导致任务间产生显著干扰并降低视频清晰度，因为交互空间被限制在低层语义RGB空间中。为解决此问题，我们提出创新的统一框架SwapTalk，该框架在相同潜在空间中同时完成人脸交换与唇部同步任务。借鉴近期人脸生成研究，我们选择具有优异可编辑性与保真性能的VQ嵌入空间。为提升框架对未见身份的泛化能力，我们在人脸交换模块训练过程中引入身份损失。此外，在唇部同步模块训练阶段，我们在潜在空间中引入专家判别器监督以提升同步质量。在评估阶段，以往研究主要关注同步音视频中唇部运动的自重构能力。为更贴近实际应用场景，我们将评估范围扩展至异步音视频情境，同时引入新型身份一致性度量指标，用以更全面地评估生成人脸视频在时间序列上的身份保持能力。在HDTF数据集上的实验结果表明，本方法在视频质量、唇部同步精度、人脸交换保真度及身份一致性等方面均显著超越现有技术。我们的演示系统已上线：http://swaptalk.cc。