VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

Video face swapping is becoming increasingly popular across various applications, yet existing methods primarily focus on static images and struggle with video face swapping because of temporal consistency and complex scenarios. In this paper, we present the first diffusion-based framework specifically designed for video face swapping. Our approach introduces a novel image-video hybrid training framework that leverages both abundant static image data and temporal video sequences, addressing the inherent limitations of video-only training. The framework incorporates a specially designed diffusion model coupled with a VidFaceVAE that effectively processes both types of data to better maintain temporal coherence of the generated videos. To further disentangle identity and pose features, we construct the Attribute-Identity Disentanglement Triplet (AIDT) Dataset, where each triplet has three face images, with two images sharing the same pose and two sharing the same identity. Enhanced with a comprehensive occlusion augmentation, this dataset also improves robustness against occlusions. Additionally, we integrate 3D reconstruction techniques as input conditioning to our network for handling large pose variations. Extensive experiments demonstrate that our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods, while requiring fewer inference steps. Our approach effectively mitigates key challenges in video face swapping, including temporal flickering, identity preservation, and robustness to occlusions and pose variations.

翻译：视频人脸交换在各种应用中日益普及，然而现有方法主要关注静态图像，且由于时序一致性和复杂场景的挑战，在视频人脸交换方面存在困难。本文提出了首个专门为视频人脸交换设计的基于扩散的框架。我们的方法引入了一种新颖的图像-视频混合训练框架，该框架同时利用丰富的静态图像数据和时序视频序列，解决了仅使用视频训练的固有局限性。该框架结合了一个专门设计的扩散模型与一个VidFaceVAE，能有效处理两类数据以更好地保持生成视频的时序一致性。为了进一步解耦身份和姿态特征，我们构建了属性-身份解耦三元组（AIDT）数据集，其中每个三元组包含三张人脸图像，其中两张图像共享相同姿态，两张共享相同身份。通过全面的遮挡增强，该数据集还提升了对遮挡的鲁棒性。此外，我们集成了3D重建技术作为网络的输入条件，以处理大幅度的姿态变化。大量实验表明，与现有方法相比，我们的框架在身份保持、时序一致性和视觉质量方面均实现了更优的性能，同时需要更少的推理步骤。我们的方法有效缓解了视频人脸交换中的关键挑战，包括时序闪烁、身份保持以及对遮挡和姿态变化的鲁棒性。