Implementing fine-grained emotion control is crucial for emotion generation tasks because it enhances the expressive capability of the generative model, allowing it to accurately and comprehensively capture and express various nuanced emotional states, thereby improving the emotional quality and personalization of generated content. Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. In order to address this challenge, we propose a visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module. Additionally, we propose an emotion intensity control method using a fine-grained emotion matrix. Through these, effective control over emotional expression in the generated videos and finer classification of emotion intensity are accomplished. Subsequently, a series of 3DMM coefficient generation networks are designed to predict 3D coefficients, followed by the utilization of a rendering network to generate the final video. Our experimental results demonstrate that our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization. Project page: https://peterfanfan.github.io/EmoSpeaker/
翻译:实现细粒度情感控制对于情感生成任务至关重要,因为它能增强生成模型的表达能力,使其准确而全面地捕捉并表达各种细微的情感状态,从而提升生成内容的情感质量和个性化程度。仅凭一张肖像和一段音频录音生成能精准刻画情感表达的细粒度面部动画是一项挑战。为应对这一挑战,我们提出了一种视觉属性引导的音频解耦器,该方法能获取仅与音频内容相关的内容向量,从而提升后续唇部运动系数预测的稳定性。为实现更精确的情感表达,我们引入了细粒度情感系数预测模块。此外,我们还提出了一种利用细粒度情感矩阵的情感强度控制方法。通过这些方法,我们实现了对生成视频中情感表达的有效控制,并完成了对情感强度的更精细分类。随后,我们设计了一系列3DMM系数生成网络来预测三维系数,并利用渲染网络生成最终视频。实验结果表明,我们提出的方法EmoSpeaker在表情变化和唇部同步方面优于现有的情感说话人脸生成方法。项目页面:https://peterfanfan.github.io/EmoSpeaker/