Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion. However, existing methods often neglect emotional facial expressions or fail to disentangle them from speech content. To address this issue, this paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions. Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels. Then an emotion-guided feature fusion decoder is employed to generate a 3D talking face with enhanced emotion. The decoder is driven by the disentangled identity, emotional, and content embeddings so as to generate controllable personal and emotional styles. Finally, considering the scarcity of the 3D emotional talking face data, we resort to the supervision of facial blendshapes, which enables the reconstruction of plausible 3D faces from 2D emotional data, and contribute a large-scale 3D emotional talking face dataset (3D-ETF) to train the network. Our experiments and user studies demonstrate that our approach outperforms state-of-the-art methods and exhibits more diverse facial movements. We recommend watching the supplementary video: https://ziqiaopeng.github.io/emotalk
翻译:语音驱动的3D面部动画旨在生成与语音内容和情感相匹配的真实面部表情。然而,现有方法往往忽略情感面部表情或无法将其与语音内容解耦。为解决此问题,本文提出一种端到端神经网络,用于解耦语音中的不同情感以生成丰富的3D面部表情。具体而言,我们引入情感解耦编码器(EDE),通过使用不同情感标签的交叉重构语音信号,将语音中的情感与内容分离。随后,采用情感引导的特征融合解码器生成具有增强情感的3D说话人脸。该解码器由解耦的身份、情感和内容嵌入驱动,从而生成可控的个性化与情感风格。最后,鉴于3D情感说话人脸数据的稀缺性,我们借助面部混合形状的监督,从2D情感数据中重建合理的3D人脸,并贡献了一个大规模3D情感说话人脸数据集(3D-ETF)以训练网络。我们的实验和用户研究表明,本文方法优于现有最先进技术,并展现出更多样的面部运动。推荐观看补充视频:https://ziqiaopeng.github.io/emotalk