ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE

Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available (https://github.com/uuembodiedsocialai/ProbTalk3D/).

翻译：语音驱动的三维面部动画合成一直是学术界和工业界共同关注的研究热点。尽管该领域已取得显著成果，但现有方法大多聚焦于唇形同步与身份控制，忽视了情感及其控制在生成过程中的作用。这主要归因于缺乏情感丰富的面部动画数据以及能够同时合成带情感表达的语音动画的算法。此外，绝大多数模型是确定性的，即给定相同的音频输入，它们会产生相同的输出动作。我们认为情感与非确定性对于生成多样化且情感丰富的面部动画至关重要。本文提出ProbTalk3D——一种基于两阶段VQ-VAE模型的非确定性神经网络方法，用于情感可控的语音驱动三维面部动画合成，并采用了情感丰富的三维面部动画数据集3DMEAD。我们通过客观评估、定性分析及感知用户研究，将所提模型与近期三维面部动画合成方法进行了全面比较。我们重点提出了若干更适用于评估随机性输出的客观指标，并同时使用真实场景数据与基准真值数据进行主观评估。据我们所知，这是首个结合丰富情感数据集，并通过情感标签与强度级别实现情感控制的非确定性三维面部动画合成方法。评估结果表明，所提模型在情感可控、确定性及非确定性模型中均展现出优于现有技术的性能。建议观看补充视频以进行质量评判。全部代码已公开（https://github.com/uuembodiedsocialai/ProbTalk3D/）。