This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality. We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework in generating high-quality, emotionally rich 3D animations that are better aligned with human preferences.
翻译:本文提出了一种新颖的3D语音驱动动画生成框架,旨在解决现有模型在生成多样化且情感共鸣动画方面的不足。当前STA模型生成的动画往往缺乏情感深度和多样性,未能符合人类预期。为克服这些局限,我们引入了一种结合奖励模型的新型STA模型。通过交叉耦合训练方法,该组合能够在音频条件下解耦情感与内容。此外,我们开发了一种利用生成面部动画的自动质量评估来指导强化学习过程的训练方法。该方法鼓励STA模型探索更广泛的可能性,从而生成质量更优、多样化且富有情感表现力的面部动画。我们在基准数据集上进行了大量实证实验,结果验证了所提框架在生成高质量、情感丰富且更符合人类偏好的3D动画方面的有效性。