Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
翻译:手势对于增强伴随语音的交流至关重要,它能提供视觉上的强调并补充言语互动。以往的研究多集中于点级运动或完全监督的数据驱动方法,而本文聚焦于伴随语音的手势,倡导弱监督学习与像素级运动偏差。我们提出了一种弱监督框架,该框架学习专门为伴随语音手势视频生成而设计的潜在表示偏差。我们的方法采用扩散模型来整合潜在运动特征,从而实现更精确、更细致的手势表征。通过利用潜在空间中的弱监督偏差,我们有效地生成了对手势和嘴部运动,这对逼真的视频生成至关重要。实验表明,我们的方法显著提升了视频质量,超越了当前的最先进技术。