Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.
翻译:语音驱动的三维人脸动画在学术界和工业界均是一项引人关注的任务。传统方法通常侧重于学习从语音到动画的确定性映射,而近期研究开始考虑语音驱动三维人脸动画的非确定性特征,并采用扩散模型完成该任务。然而,现有基于扩散模型的方法仍存在两大局限:人脸动画的个性化生成与动画生成速度的优化。为突破上述局限,我们提出DiffusionTalker——一种基于扩散模型的方法,通过对比学习实现三维人脸动画的个性化,并采用知识蒸馏加速三维动画生成。具体而言,为实现个性化,我们引入可学习的对话身份嵌入,用于聚合音频序列中的知识。该身份嵌入通过对比学习方式,提取不同个体间的定制化面部特征。推理阶段,用户可根据输入音频获得反映特定说话风格的个性化人脸动画。通过将训练完成的数百步扩散模型蒸馏为仅需8步的轻量化模型,我们实现了生成加速。大量实验表明,本方法性能优于现有最优技术。相关代码将开源。