Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.
翻译:人脸换脸旨在通过将源人脸的身份特征迁移至目标人脸,同时保留姿势、表情和背景信息,以生成逼真的人脸图像。然而,现有方法(尤其是基于GAN的方法)常因可控性不足和模式崩溃而难以平衡身份保留与视觉真实感。本文提出CA-IDD(交叉注意力引导的身份条件扩散方法),这是首个基于扩散模型的换脸方法,通过多尺度交叉注意力整合凝视、身份和面部解析等多模态引导。通过分层注意力机制将预计算的身份嵌入融入去噪过程,从而实现准确且一致的身份迁移。为提升语义连贯性和视觉质量,我们采用包含面部解析和凝视一致性模块的专家引导监督。与基于GAN或隐式融合的方法不同,本扩散框架提供稳定的训练、鲁棒的泛化能力以及空间自适应的身份对齐,支持对姿势和表情变化的细粒度区域控制。CA-IDD的FID指标达到11.73,超越了FaceShifter和MegaFS等现有基线方法。定性结果也表明,该方法在不同姿势下均展现出更优的身份保留能力,为未来基于扩散模型的人脸编辑奠定了坚实基础。