Emotional Speech-Driven Animation with Content-Emotion Disentanglement

To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.

翻译：为使3D面部虚拟角色得到广泛应用，其必须能够轻松、逼真地直接从语音信号驱动。尽管近年来最先进的方法能生成与输入音频同步的3D动画，但它们大多忽略了情感对面部表情的影响。逼真的面部动画需要唇部同步与情感的自然表达相结合。为此，我们提出EMOTE（面向情感表达的优化式谈话模型），该模型能生成保持语音唇部同步的3D谈话虚拟角色，同时实现对情感表达的显式控制。为实现这一目标，我们通过解耦的损失函数分别监督语音（即唇部同步）与情感。这些损失基于两个关键观察：（1）语音引起的面部形变在空间上集中于口部区域且具有高时间频率，而（2）面部表情可能涉及整个面部且持续时间更长。因此，我们采用逐帧唇读损失训练EMOTE以保留语音相关的内容，同时在序列级别监督情感。此外，我们引入内容-情感交换机制，在同一音频上监督不同情感的同时保持唇部运动与语音同步。为避免使用深度感知损失时产生不良伪影，我们设计了以时间变分自编码器形式存在的运动先验。由于缺乏高质量的情感化3D面部语音对齐数据集，EMOTE使用从情感视频数据集（即MEAD）提取的3D伪真值进行训练。大量定性和感知评估表明，EMOTE能生成比基于相同数据训练的最先进方法更好的唇部同步语音驱动面部动画，同时提供额外的高质量情感控制能力。