Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Subsequently, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Then PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. We recommend watching the supplementary video.
翻译:语音驱动的三维面部动画近期取得了显著进展,但大多数相关研究仅利用声学模态,忽视了视觉和文本线索的影响,导致在精度和连贯性方面结果不尽如人意。我们认为视觉和文本线索并非次要信息。为此,我们提出了一种新型框架PMMTalk,利用互补的伪多模态特征提升面部动画的准确性。该框架包含三个模块:PMMTalk编码器、跨模态对齐模块和PMMTalk解码器。具体而言,PMMTalk编码器利用现成的说话头生成架构和语音识别技术,分别从语音中提取视觉和文本信息。随后,跨模态对齐模块在时序和语义层面将音频-图像-文本特征进行对齐。然后使用PMMTalk解码器预测唇形同步的面部混合变形系数。与先前方法不同,PMMTalk仅需一张随机参考人脸图像,却能生成更精确的结果。此外,该方法通过引入面部混合变形系数,可无缝集成到标准动画制作流程中,具有艺术家友好性。最后,针对三维说话人脸数据集的稀缺性,我们提出了大规模的三维中文视听面部动画数据集(3D-CAVFA)。大量实验和用户研究表明,我们的方法优于现有技术水平。建议观看补充视频。