Speech-driven 3D face animation technique, extending its applications to various multimedia fields. Previous research has generated promising realistic lip movements and facial expressions from audio signals. However, traditional regression models solely driven by data face several essential problems, such as difficulties in accessing precise labels and domain gaps between different modalities, leading to unsatisfactory results lacking precision and coherence. To enhance the visual accuracy of generated lip movement while reducing the dependence on labeled data, we propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces. The framework constructs a network system consisting of three modules: facial animator, speech recognizer, and lip-reading interpreter. The core of SelfTalk is a commutative training diagram that facilitates compatible features exchange among audio, text, and lip shape, enabling our models to learn the intricate connection between these factors. The proposed framework leverages the knowledge learned from the lip-reading interpreter to generate more plausible lip shapes. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. We recommend watching the supplementary video.
翻译:语音驱动的3D人脸动画技术正将其应用扩展至多媒体领域。现有研究已能从音频信号生成具有逼真唇部运动与面部表情的成果。然而,传统纯粹由数据驱动的回归模型面临诸多关键问题,例如难以获取精确标注数据以及不同模态间的领域差异,导致生成结果缺乏精确性与连贯性。为提升生成唇部运动的视觉准确度并降低对标注数据的依赖,我们提出新型框架SelfTalk,通过在跨模态网络系统中引入自监督学习来理解3D说话人脸。该框架构建由三个模块构成的网络系统:面部动画生成器、语音识别器与唇语解读器。SelfTalk的核心在于构建交换训练范式,促进音频、文本与唇形间的兼容特征交换,使模型能够学习这些要素间的复杂关联。所提框架利用唇语解读器习得的知识生成更合理的唇形。大量实验与用户研究表明,本方法在定性与定量指标上均达到最优水平。建议观看补充视频。