Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.
翻译:语音驱动的3D人脸动画生成是学术界与工业界共同关注的重要任务。传统方法主要侧重于学习从语音到动画的确定性映射,而近期研究开始关注语音驱动3D人脸动画的非确定性特性,并引入扩散模型处理该任务。然而,现有基于扩散模型的方法仍存在两大局限:人脸动画的个性化生成与动画生成速度的加速。为解决上述问题,我们提出DiffusionTalker——一种基于扩散模型的方法,通过对比学习实现3D人脸动画的个性化,并利用知识蒸馏加速3D动画生成。具体而言,为支持个性化定制,我们引入可学习的说话人身份特征以聚合音频序列中的知识。所提出的身份嵌入通过对比学习方式提取不同个体的定制化面部特征。在推理阶段,用户可根据输入音频获得反映特定说话风格的个性化人脸动画。在完成需数百步迭代的扩散模型训练后,我们将其蒸馏为仅需8步推理的轻量级模型以实现加速。大量实验表明,本方法性能优于当前最先进技术。相关代码将开源发布。