FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

Talking head generation is a significant research topic that still faces numerous challenges. Previous works often adopt generative adversarial networks or regression models, which are plagued by generation quality and average facial shape problem. Although diffusion models show impressive generative ability, their exploration in talking head generation remains unsatisfactory. This is because they either solely use the diffusion model to obtain an intermediate representation and then employ another pre-trained renderer, or they overlook the feature decoupling of complex facial details, such as expressions, head poses and appearance textures. Therefore, we propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk, which fully leverages the advantages of diffusion models and decouples the complex facial details through multi-stages. Specifically, we separate facial details into motion and appearance. In the initial phase, we design the Diffusion Transformer to accurately predict motion coefficients from raw audio. These motions are highly decoupled from appearance, making them easier for the network to learn compared to high-dimensional RGB images. Subsequently, in the second phase, we encode the reference image to capture appearance textures. The predicted facial and head motions and encoded appearance then serve as the conditions for the Diffusion UNet, guiding the frame generation. Benefiting from decoupling facial details and fully leveraging diffusion models, extensive experiments substantiate that our approach excels in enhancing image quality and generating more accurate and diverse results compared to previous state-of-the-art methods.

翻译：说话头部生成是一个重要的研究课题，但仍面临诸多挑战。先前的研究通常采用生成对抗网络或回归模型，这些方法常受限于生成质量和平均面部形状问题。尽管扩散模型展现出令人印象深刻的生成能力，但其在说话头部生成领域的探索仍不尽如人意。这是因为现有方法要么仅使用扩散模型获取中间表示，再借助另一个预训练的渲染器进行生成；要么忽视了复杂面部细节（如表情、头部姿态和外观纹理）的特征解耦。为此，我们提出了一种用于说话头部生成的面部解耦扩散模型FD2Talk，该模型充分利用扩散模型的优势，并通过多阶段处理实现复杂面部细节的解耦。具体而言，我们将面部细节分解为运动和外观两部分。在初始阶段，我们设计了Diffusion Transformer，用于从原始音频中精确预测运动系数。这些运动特征与外观高度解耦，相比高维RGB图像更易于网络学习。随后，在第二阶段，我们对参考图像进行编码以提取外观纹理。预测的面部与头部运动以及编码后的外观特征，将作为Diffusion UNet的条件输入，指导视频帧的生成。得益于面部细节的解耦处理以及对扩散模型的充分利用，大量实验证明，相较于以往最先进的方法，我们的方法在提升图像质量、生成更准确且多样化的结果方面表现优异。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日