The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained model will be made available.
翻译:近年来,三维说话人头像生成领域取得了显著进展。该领域的一个突出挑战在于如何将语音相关运动与表情动态相融合,这主要源于缺乏兼具口语语句多样性与各种面部表情特征的综合三维数据集。尽管现有文献尝试利用二维视频数据和参数化三维模型作为替代方案,但这些方法在联合建模两种运动时仍存在局限性。在本研究中,我们从不同视角解决这一问题,提出了一种创新的数据驱动技术,用于创建名为EmoVOCA的合成数据集。该数据集通过结合一组无表情三维说话人头像与一组三维表情序列生成。为证明该方法优势及数据集质量,我们设计并训练了一个情感化三维说话人头像生成器,其输入包括三维人脸、音频文件、情感标签及强度值,并学习以面部表情特征驱动与音频同步的嘴唇运动动画。基于我们的数据与生成器进行的全面定量与定性实验表明,相较于文献中性能最佳的方法,本方法在合成逼真动画方面展现出显著优势。我们的代码与预训练模型将开源提供。