Speech-driven 3D facial animation synthesis has been a challenging task both in industry and research. Recent methods mostly focus on deterministic deep learning methods meaning that given a speech input, the output is always the same. However, in reality, the non-verbal facial cues that reside throughout the face are non-deterministic in nature. In addition, majority of the approaches focus on 3D vertex based datasets and methods that are compatible with existing facial animation pipelines with rigged characters is scarce. To eliminate these issues, we present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations that is trained with both 3D vertex and blendshape based datasets. Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input. To the best of our knowledge, we are the first to employ the diffusion method for the task of speech-driven 3D facial animation synthesis. We have run extensive objective and subjective analyses and show that our approach achieves better or comparable results in comparison to the state-of-the-art methods. We also introduce a new in-house dataset that is based on a blendshape based rigged character. We recommend watching the accompanying supplementary video. The code and the dataset will be publicly available.
翻译:语音驱动的3D面部动画合成一直是工业界和研究领域中的一项挑战性任务。现有方法大多聚焦于确定性深度学习方法,即给定语音输入后输出总是相同的。然而在现实中,分布于整个面部的非言语面部信号本质上具有非确定性特征。此外,大多数方法专注于基于3D顶点的数据集,而与现有带骨骼绑定角色的面部动画流水线兼容的方法较为稀缺。为解决这些问题,我们提出了FaceDiffuser——一种用于生成语音驱动面部动画的非确定性深度学习模型,该模型同时使用基于3D顶点和混合形状的数据集进行训练。我们的方法基于扩散技术,并采用预训练的大规模语音表征模型HuBERT对音频输入进行编码。据我们所知,这是首次将扩散方法应用于语音驱动的3D面部动画合成任务。我们进行了广泛的主客观分析,结果表明我们的方法相较于现有最优方法取得了更优或相当的结果。我们还引入了一种基于混合形状绑定角色的内部新数据集。建议观看附带的补充视频。代码和数据集将公开提供。