Emotional Speech-Driven Animation with Content-Emotion Disentanglement

To be widely adopted, 3D facial avatars need to be animated easily, realistically, and directly, from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Instead, their focus is on modeling the correlations between speech and facial motion, resulting in animations that are unemotional or do not match the input emotion. We observe that there are two contributing factors resulting in facial animation - the speech and the emotion. We exploit these insights in EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking head avatars that maintain lip sync while enabling explicit control over the expression of emotion. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained from an emotional video dataset (i.e., MEAD). To achieve this, we match speech-content between generated sequences and target videos differently from emotion content. Specifically, we train EMOTE with additional supervision in the form of a lip-reading objective to preserve the speech-dependent content (spatially local and high temporal frequency), while utilizing emotion supervision on a sequence-level (spatially global and low frequency). Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotion on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in form of a temporal VAE. Extensive qualitative, quantitative, and perceptual evaluations demonstrate that EMOTE produces state-of-the-art speech-driven facial animations, with lip sync on par with the best methods while offering additional, high-quality emotional control.

翻译：为了获得广泛应用，3D面部化身需要能够从语音信号中轻松、真实且直接地生成动画。尽管当前最先进的方法能够生成与输入音频同步的3D动画，但它们在很大程度上忽略了情感对面部表情的影响。相反，这些方法侧重于建模语音与面部运动之间的关联，导致生成的动画缺乏情感或与输入情感不匹配。我们观察到，面部动画的生成受两个因素共同驱动：语音与情感。我们利用这些洞见提出了EMOTE（专为情感对话优化的表现力模型），该模型能够生成3D对话头部化身，在保持口型同步的同时，允许对表情情感进行显式控制。由于缺乏高质量且对齐的情感3D面部数据集与语音配对，EMOTE基于情感视频数据集（如MEAD）进行训练。为实现这一目标，我们使生成序列与目标视频在语音内容匹配上与情感内容匹配采用不同策略。具体而言，我们在训练EMOTE时引入了额外的监督信号：一方面通过唇读目标保留依赖语音的内容（空间局部且高频时域特征），另一方面在序列层面（空间全局且低频特征）施加情感监督。此外，我们设计了内容-情感交换机制，使得在保持口型运动与语音同步的前提下，对同一音频施加不同情感监督。为避免使用深度感知损失时产生伪影，我们以时域VAE形式构建了运动先验。大量定性、定量及感知评估表明，EMOTE能生成最先进的语音驱动面部动画，其口型同步性能与最佳方法相当，同时提供额外的高质量情感控制。