Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio. However, existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature rearrange paradigm coupled with unstable GAN frameworks. In this work, we first represent the emotion in the text prompt, which could inherit rich semantics from the CLIP, allowing flexible and generalized emotion control. We further reorganize these tasks as the target-oriented texture transfer and adopt the Diffusion Models. More specifically, given a textured face as the source and the rendered face projected from the desired 3DMM coefficients as the target, our proposed Texture-Geometry-aware Diffusion Model decomposes the complex transfer problem into multi-conditional denoising process, where a Texture Attention-based module accurately models the correspondences between appearance and geometry cues contained in source and target conditions, and incorporate extra implicit information for high-fidelity talking face generation. Additionally, TGDM can be gracefully tailored for face swapping. We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes. Extensive experiments demonstrate the superiority of our method.
翻译:多模态驱动的说话人脸生成是指利用驱动图像或视频中传递的姿态、表情和视线,或通过文本和音频估计出的这些信息,使得肖像动画化。然而,现有方法忽略了文本模态的潜力,其生成器主要遵循基于源导向的特征重排范式,并搭配不稳定的GAN框架。在本工作中,我们首次在文本提示中表示情感,这能够继承CLIP中的丰富语义,从而实现灵活且通用的情感控制。我们进一步将这些任务重新组织为目标导向的纹理迁移,并采用扩散模型。具体来说,给定一个带纹理的人脸作为源,以及根据所需3DMM系数投影得到的渲染人脸作为目标,我们提出的纹理几何感知扩散模型将复杂的迁移问题分解为多条件去噪过程。其中,基于纹理注意力模块精确建模源与目标条件中所含外观与几何线索之间的对应关系,并融入额外的隐式信息,以生成高保真的说话人脸。此外,TGDM还可优雅地适配于人脸换脸任务。我们推导出一种无需不稳定的跷跷板式优化的新范式,从而实现了简单、稳定且有效的训练与推理方案。大量实验证明了我们方法的优越性。