We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.
翻译:本文提出了一种免训练、即插即用的高质量视频人脸替换方法VFace。该方法可无缝集成于基于扩散模型的图像人脸替换框架。首先,我们提出频谱注意力插值技术,以促进生成过程并保持关键身份特征。其次,通过即插即用的注意力注入机制实现目标结构引导,从而将目标帧的结构特征更好地对齐到生成过程中。第三,我们提出流引导注意力时序平滑机制,在不修改底层扩散模型的前提下增强时空一致性,有效缓解逐帧生成中常见的时序不一致问题。本方法无需额外训练或视频数据微调。大量实验表明,该方法显著提升了时序一致性与视觉保真度,为视频人脸替换提供了实用且模块化的解决方案。代码已开源:https://github.com/Sanoojan/VFace。