Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets.

翻译：生成生动且富有情感的三维共语手势在人机交互应用中的虚拟角色动画中至关重要。现有方法虽然能够生成遵循单一情感标签的手势，但忽略了实际场景中更实用的长序列手势建模与情感过渡。此外，缺乏具备情感过渡语音及对应三维人体手势的大规模可用数据集，也限制了该任务的解决。为实现这一目标，我们首先整合ChatGPT-4与音频修复方法，构建高保真情感过渡人类语音。考虑到获取与动态修复的情感过渡音频对应的真实三维姿态标注极为困难，我们提出一种新颖的弱监督训练策略，以鼓励权威性手势过渡。具体而言，为增强过渡手势与不同情感手势之间的协调性，我们将两种不同情感手势序列间的时间关联表示建模为风格引导，并注入过渡生成过程。我们进一步设计情感混合机制，基于可学习的混合情感标签为过渡手势提供弱监督。最后，我们提出关键帧采样器，为长序列提供有效的初始姿态线索，从而生成多样化手势。大量实验表明，在我们新定义的情感过渡任务及数据集上，本方法优于通过适配单情感条件模型构建的最新方法。