Most current audio-driven facial animation research primarily focuses on generating videos with neutral emotions. While some studies have addressed the generation of facial videos driven by emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge. In this paper, we propose ESGaussianFace, an innovative framework for emotional and stylized audio-driven facial animation. Our approach leverages 3D Gaussian Splatting to reconstruct 3D scenes and render videos, ensuring efficient generation of 3D consistent results. We propose an emotion-audio-guided spatial attention method that effectively integrates emotion features with audio content features. Through emotion-guided attention, the model is able to reconstruct facial details across different emotional states more accurately. To achieve emotional and stylized deformations of the 3D Gaussian points through emotion and style features, we introduce two 3D Gaussian deformation predictors. Futhermore, we propose a multi-stage training strategy, enabling the step-by-step learning of the character's lip movements, emotional variations, and style features. Our generated results exhibit high efficiency, high quality, and 3D consistency. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art techniques in terms of lip movement accuracy, expression variation, and style feature expressiveness.
翻译:当前大多数音频驱动面部动画研究主要聚焦于生成中性情感的视频。尽管已有部分研究尝试实现情感音频驱动的面部视频生成,但如何高效生成同时融合情感表达与风格特征的高质量说话头视频仍是一项重大挑战。本文提出ESGaussianFace,一种创新的情感化与风格化音频驱动面部动画框架。该方法利用3D高斯泼溅技术重建三维场景并渲染视频,确保高效生成具有三维一致性的结果。我们提出一种情感-音频引导的空间注意力机制,能有效融合情感特征与音频内容特征。通过情感引导注意力,模型能够更精确地重建不同情感状态下的面部细节。为实现通过情感与风格特征对三维高斯点进行情感化与风格化形变,我们引入了两个三维高斯形变预测器。此外,我们提出多阶段训练策略,使模型能够分阶段学习人物的唇部运动、情感变化及风格特征。本方法生成的结果展现出高效率、高质量与三维一致性。大量实验结果表明,在唇部运动准确性、表情变化丰富度及风格特征表现力方面,我们的方法优于现有最先进技术。