FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that allows to capture personalized and subtle cues in speech (e.g. identity, emotion and hesitation). It is also very robust to background noise and can handle audio recorded in a variety of situations (e.g. multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate facial animation for the whole face. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-synching, expressivity, person-specific information and generalizability. We effectively employ self-supervised pretrained HuBERT model in the training process that allows us to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Additionally, guiding the training with a binary emotion condition and speaker identity distinguishes the tiniest subtle facial motion. We carried out extensive objective and subjective evaluation in comparison to ground-truth and state-of-the-art work. A perceptual user study demonstrates that our approach produces superior results with respect to the realism of the animation 78% of the time in comparison to the state-of-the-art. In addition, our method is 4 times faster eliminating the use of complex sequential models such as transformers. We strongly recommend watching the supplementary video before reading the paper. We also provide the implementation and evaluation codes with a GitHub repository link.

翻译：本文提出FaceXHuBERT——一种无文本的语音驱动3D面部动画生成方法，能够捕捉语音中个性化的细微线索（如身份特征、情感状态和犹豫情绪），同时对背景噪声具有强鲁棒性，可处理多种场景下的音频输入（例如多人对话）。现有方法采用端到端深度学习框架，将音频与文本共同作为输入来生成全脸面部动画。然而，公开可用的带表情音频-3D面部动画数据集稀缺构成主要瓶颈，导致生成的动画在精准唇形同步、表现力、个体特征泛化及通用性方面仍存在缺陷。我们创新性地在训练过程中引入自监督预训练HuBERT模型，无需大规模词典即可同时提取语音中的词汇与非词汇信息。此外，通过二元情感条件约束与说话人身份引导训练，可精准区分最细微的面部运动差异。我们与真实数据及现有最优方法进行了全面客观与主观对比评估。感知用户研究表明，相较于当前最优方法，本方法在78%的测试场景中生成动画的真实感更优。同时，该方法放弃使用Transformer等复杂序列模型，处理速度提升4倍。强烈建议读者在阅读论文前观看补充视频，我们还提供了包含GitHub仓库链接的算法实现与评估代码。