Although previous co-speech gesture generation methods are able to synthesize motions in line with speech content, it is still not enough to handle diverse and complicated motion distribution. The key challenges are: 1) the one-to-many nature between the speech content and gestures; 2) the correlation modeling between the body joints. In this paper, we present a novel framework (EMoG) to tackle the above challenges with denoising diffusion models: 1) To alleviate the one-to-many problem, we incorporate emotion clues to guide the generation process, making the generation much easier; 2) To model joint correlation, we propose to decompose the difficult gesture generation into two sub-problems: joint correlation modeling and temporal dynamics modeling. Then, the two sub-problems are explicitly tackled with our proposed Joint Correlation-aware transFormer (JCFormer). Through extensive evaluations, we demonstrate that our proposed method surpasses previous state-of-the-art approaches, offering substantial superiority in gesture synthesis.
翻译:尽管先前的共语手势生成方法能够合成与语音内容一致的动作,但仍不足以处理多样且复杂的运动分布。关键挑战在于:1)语音内容与手势之间的一对多特性;2)身体关节间的相关性建模。本文提出一种新型框架(EMoG),通过去噪扩散模型应对上述挑战:1)为缓解一对多问题,我们引入情感线索引导生成过程,使生成更加容易;2)为建模关节相关性,我们提出将困难的手势生成分解为两个子问题:关节相关性建模和时间动态建模。随后,通过我们提出的关节相关性感知Transformer(JCFormer)显式处理这两个子问题。通过广泛评估,我们证明所提方法超越先前最先进方法,在手势合成中展现出显著优势。