Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.
翻译:共语手势若以生动的视频形式呈现,可在人机交互中实现更优的视觉效果。现有研究大多生成结构化的骨骼骨架,导致外观信息的缺失,而本工作聚焦于音频驱动的共语手势视频的端到端生成。主要面临两大挑战:1)需要设计合适的运动特征来描述包含关键外观信息的复杂人体运动;2)手势与语音存在内在依赖关系,且需在任意时长下保持时序对齐。为解决上述问题,我们提出了一种新颖的运动解耦框架用于生成共语手势视频。具体而言,首先引入精心设计的非线性TPS变换,获取保留关键外观信息的潜在运动特征;随后提出基于Transformer的扩散模型,学习手势与语音间的时序相关性,并在潜在运动空间中完成生成;进而设计最优运动选择模块,以生成长期连贯一致的连续手势视频。为提升视觉感知效果,我们还构建了针对特定区域缺失细节的优化网络。大量实验表明,本框架在运动评估和视频评估指标上均显著优于现有方法。相关代码、演示及资源已发布于 https://github.com/thuhcsi/S2G-MDDiffusion。