Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.
翻译:人脸替换旨在生成结合源图像身份与目标图像属性的结果。现有方法主要集中于基于图像的人脸替换。在处理视频时,每帧被独立处理,难以确保时间稳定性。从模型角度看,人脸替换正逐渐从生成对抗网络(GANs)转向扩散模型(DMs),因为DMs已被证明具备更强的生成能力。当前基于扩散的方法常采用修复技术,难以保留光照和妆容等细粒度属性。为应对这些挑战,我们提出了一个高保真视频人脸替换(HiFiVFS)框架,该框架利用了稳定视频扩散(SVD)的强大生成能力和时间先验。我们构建了一个细粒度属性模块,通过身份脱敏和对抗学习来提取身份解耦的细粒度属性特征。此外,我们引入了详细身份注入以进一步增强身份相似度。大量实验表明,我们的方法在定性和定量上均实现了视频人脸替换的最先进(SOTA)性能。