Diffusion models have shown remarkable success in a variety of downstream generative tasks, yet remain under-explored in the important and challenging expressive talking head generation. In this work, we propose a DreamTalk framework to fulfill this gap, which employs meticulous design to unlock the potential of diffusion models in generating expressive talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network is able to consistently synthesize high-quality audio-driven face motions across diverse expressions. To enhance the expressiveness and accuracy of lip motions, we introduce a style-aware lip expert that can guide lip-sync while being mindful of the speaking styles. To eliminate the need for expression reference video or text, an extra diffusion-based style predictor is utilized to predict the target expression directly from the audio. By this means, DreamTalk can harness powerful diffusion models to generate expressive faces effectively and reduce the reliance on expensive style references. Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles and achieving accurate lip motions, surpassing existing state-of-the-art counterparts.
翻译:扩散模型在各种下游生成任务中已展现出显著的成效,但在重要且具有挑战性的富有表现力说话头像生成领域仍鲜有探索。为此,本文提出DreamTalk框架以填补这一空白,该框架通过精心设计释放扩散模型在生成富有表现力说话头像方面的潜力。具体而言,DreamTalk由三个关键组件构成:去噪网络、风格感知唇部专家和风格预测器。基于扩散的去噪网络能够持续合成由音频驱动、涵盖多种表情的高质量面部动作。为增强唇部运动的表现力和准确性,我们引入风格感知唇部专家,它在关注说话风格的同时引导唇形同步。为消除对表情参考视频或文本的需求,我们采用额外的基于扩散的风格预测器,直接从音频预测目标表情。通过这种方式,DreamTalk能够利用强大的扩散模型有效生成富有表现力的面部,并减少对昂贵风格参考的依赖。实验结果表明,DreamTalk能够生成具有多样说话风格的照片级真实感说话面部,并实现准确的唇部运动,超越了现有的最先进方法。