The creation of lifelike speech-driven 3D facial animation requires a natural and precise synchronization between audio input and facial expressions. However, existing works still fail to render shapes with flexible head poses and natural facial details (e.g., wrinkles). This limitation is mainly due to two aspects: 1) Collecting training set with detailed 3D facial shapes is highly expensive. This scarcity of detailed shape annotations hinders the training of models with expressive facial animation. 2) Compared to mouth movement, the head pose is much less correlated to speech content. Consequently, concurrent modeling of both mouth movement and head pose yields the lack of facial movement controllability. To address these challenges, we introduce VividTalker, a new framework designed to facilitate speech-driven 3D facial animation characterized by flexible head pose and natural facial details. Specifically, we explicitly disentangle facial animation into head pose and mouth movement and encode them separately into discrete latent spaces. Then, these attributes are generated through an autoregressive process leveraging a window-based Transformer architecture. To augment the richness of 3D facial animation, we construct a new 3D dataset with detailed shapes and learn to synthesize facial details in line with speech content. Extensive quantitative and qualitative experiments demonstrate that VividTalker outperforms state-of-the-art methods, resulting in vivid and realistic speech-driven 3D facial animation.
翻译:生成逼真的语音驱动三维面部动画需要音频输入与面部表情之间自然且精确的同步。然而,现有方法在处理灵活头部姿态及自然面部细节(如皱纹)的渲染方面仍存在不足。这一局限性主要源于两个方面:1) 采集包含精细三维面部形状的训练集成本极高。此类精细形状标注的稀缺严重阻碍了具备丰富面部表情动画能力的模型训练。2) 相较于嘴部运动,头部姿态与语音内容的相关性明显更弱。因此,对嘴部运动与头部姿态的联合建模会导致面部运动可控性不足。为应对这些挑战,我们提出了VividTalker框架——一种旨在实现灵活头部姿态与自然面部细节的语音驱动三维面部动画新方案。具体而言,我们明确地将面部动画解耦为头部姿态与嘴部运动,并将两者分别编码至离散潜在空间。随后,通过基于窗口Transformer架构的自回归过程生成这些属性。为提升三维面部动画的丰富度,我们构建了包含精细形状的新型三维数据集,并学习如何根据语音内容合成面部细节。大量定量与定性实验表明,VividTalker在生成生动逼真的语音驱动三维面部动画方面显著优于现有最优方法。