Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio. However, existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature rearrange paradigm coupled with unstable GAN frameworks. In this work, we first represent the emotion in the text prompt, which could inherit rich semantics from the CLIP, allowing flexible and generalized emotion control. We further reorganize these tasks as the target-oriented texture transfer and adopt the Diffusion Models. More specifically, given a textured face as the source and the rendered face projected from the desired 3DMM coefficients as the target, our proposed Texture-Geometry-aware Diffusion Model decomposes the complex transfer problem into multi-conditional denoising process, where a Texture Attention-based module accurately models the correspondences between appearance and geometry cues contained in source and target conditions, and incorporate extra implicit information for high-fidelity talking face generation. Additionally, TGDM can be gracefully tailored for face swapping. We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes. Extensive experiments demonstrate the superiority of our method.
翻译:多模态驱动说话人脸生成是指根据驱动图像或视频中传递的姿态、表情和视线,或从文本和音频中估计出的信息,对肖像进行动画生成。然而,现有方法忽略了文本模态的潜力,其生成器主要遵循源导向特征重排范式,并耦合了不稳定的GAN框架。在本工作中,我们首先在文本提示中表示情感,该提示可从CLIP中继承丰富的语义,从而实现灵活且泛化的情感控制。我们进一步将这些任务重新组织为目标导向的纹理迁移,并采用扩散模型。具体而言,给定带纹理的人脸作为源,以及从期望的3DMM系数投影生成的渲染人脸作为目标,我们提出的纹理-几何感知扩散模型将复杂的迁移问题分解为多条件去噪过程,其中基于纹理注意力的模块精确建模源和目标条件中外观与几何线索之间的对应关系,并融入额外的隐式信息,以实现高保真说话人脸生成。此外,TGDM可优雅地适配于人脸交换任务。我们推导出一种摆脱了不稳定跷跷板式优化的新颖范式,从而实现了简单、稳定且高效的训练与推理方案。大量实验证明了我们方法的优越性。