Co-speech gesture generation is crucial for automatic digital avatar animation. However, existing methods suffer from issues such as unstable training and temporal inconsistency, particularly in generating high-fidelity and comprehensive gestures. Additionally, these methods lack effective control over speaker identity and temporal editing of the generated gestures. Focusing on capturing temporal latent information and applying practical controlling, we propose a Controllable Co-speech Gesture Generation framework, named C2G2. Specifically, we propose a two-stage temporal dependency enhancement strategy motivated by latent diffusion models. We further introduce two key features to C2G2, namely a speaker-specific decoder to generate speaker-related real-length skeletons and a repainting strategy for flexible gesture generation/editing. Extensive experiments on benchmark gesture datasets verify the effectiveness of our proposed C2G2 compared with several state-of-the-art baselines. The link of the project demo page can be found at https://c2g2-gesture.github.io/c2_gesture
翻译:共语手势生成对于数字虚拟人的自动动画至关重要。然而,现有方法存在训练不稳定和时间不一致等问题,特别是在生成高保真且全面的手势时。此外,这些方法缺乏对说话者身份的有效控制以及对生成手势的时间编辑能力。聚焦于捕捉时间潜在信息并实现实用控制,我们提出了一种名为C2G2的可控共语手势生成框架。具体而言,我们受潜在扩散模型启发,提出了一种两阶段时间依赖增强策略。我们进一步为C2G2引入两个关键特性:用于生成说话者相关实长度骨架的说话者特定解码器,以及用于灵活手势生成/编辑的重绘策略。在基准手势数据集上的大量实验验证了我们提出的C2G2相比多个最先进基线的有效性。项目演示页面链接请见https://c2g2-gesture.github.io/c2_gesture。