Talking head generation is a significant research topic that still faces numerous challenges. Previous works often adopt generative adversarial networks or regression models, which are plagued by generation quality and average facial shape problem. Although diffusion models show impressive generative ability, their exploration in talking head generation remains unsatisfactory. This is because they either solely use the diffusion model to obtain an intermediate representation and then employ another pre-trained renderer, or they overlook the feature decoupling of complex facial details, such as expressions, head poses and appearance textures. Therefore, we propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk, which fully leverages the advantages of diffusion models and decouples the complex facial details through multi-stages. Specifically, we separate facial details into motion and appearance. In the initial phase, we design the Diffusion Transformer to accurately predict motion coefficients from raw audio. These motions are highly decoupled from appearance, making them easier for the network to learn compared to high-dimensional RGB images. Subsequently, in the second phase, we encode the reference image to capture appearance textures. The predicted facial and head motions and encoded appearance then serve as the conditions for the Diffusion UNet, guiding the frame generation. Benefiting from decoupling facial details and fully leveraging diffusion models, extensive experiments substantiate that our approach excels in enhancing image quality and generating more accurate and diverse results compared to previous state-of-the-art methods.
翻译:说话头部生成是一个重要的研究课题,但仍面临诸多挑战。先前的研究通常采用生成对抗网络或回归模型,这些方法常受限于生成质量和平均面部形状问题。尽管扩散模型展现出令人印象深刻的生成能力,但其在说话头部生成领域的探索仍不尽如人意。这是因为现有方法要么仅使用扩散模型获取中间表示,再借助另一个预训练的渲染器进行生成;要么忽视了复杂面部细节(如表情、头部姿态和外观纹理)的特征解耦。为此,我们提出了一种用于说话头部生成的面部解耦扩散模型FD2Talk,该模型充分利用扩散模型的优势,并通过多阶段处理实现复杂面部细节的解耦。具体而言,我们将面部细节分解为运动和外观两部分。在初始阶段,我们设计了Diffusion Transformer,用于从原始音频中精确预测运动系数。这些运动特征与外观高度解耦,相比高维RGB图像更易于网络学习。随后,在第二阶段,我们对参考图像进行编码以提取外观纹理。预测的面部与头部运动以及编码后的外观特征,将作为Diffusion UNet的条件输入,指导视频帧的生成。得益于面部细节的解耦处理以及对扩散模型的充分利用,大量实验证明,相较于以往最先进的方法,我们的方法在提升图像质量、生成更准确且多样化的结果方面表现优异。