MoEE：面向音频驱动肖像动画的情感专家混合模型 (MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation)

The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a)the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b)the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1)the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2)the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models. Furthermore, to enhance the flexibility of emotion control, we propose an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals-such as audio, text, and labels-to ensure more varied control inputs as well as the ability to control emotions using audio alone. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.

翻译：语音驱动虚拟形象生成在精确的音频同步方面已取得显著进展。然而，要制作逼真的说话头部视频，需要捕捉广泛的情感范围与微妙的面部表情。当前方法面临以下根本性挑战：a) 缺乏对单一基础情感表达进行建模的框架，这限制了复合情感等复杂情感的生成；b) 缺少富含人类情感表达的综合性数据集，制约了模型的潜力。为应对这些挑战，我们提出以下创新：1) 情感专家混合模型，该模型解耦六种基础情感，以实现对单一及复合情感状态的精确合成；2) DH-FaceEmoVid-150 数据集，该数据集专门构建以涵盖六种常见的人类情感表达以及四类复合情感，从而扩展了情感驱动模型的训练潜力。此外，为增强情感控制的灵活性，我们提出了一个情感到潜变量的模块，该模块利用多模态输入，对齐音频、文本和标签等多种控制信号，以确保更丰富的控制输入以及仅使用音频控制情感的能力。通过大量的定量与定性评估，我们证明 MoEE 框架结合 DH-FaceEmoVid-150 数据集在生成复杂情感表达和细腻的面部细节方面表现优异，为该领域设立了新的基准。这些数据集将公开释放。