Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.

翻译：共语手势若以生动的视频形式呈现，可在人机交互中实现更优的视觉效果。现有研究大多生成结构化的骨骼骨架，导致外观信息的缺失，而本工作聚焦于音频驱动的共语手势视频的端到端生成。主要面临两大挑战：1)需要设计合适的运动特征来描述包含关键外观信息的复杂人体运动；2)手势与语音存在内在依赖关系，且需在任意时长下保持时序对齐。为解决上述问题，我们提出了一种新颖的运动解耦框架用于生成共语手势视频。具体而言，首先引入精心设计的非线性TPS变换，获取保留关键外观信息的潜在运动特征；随后提出基于Transformer的扩散模型，学习手势与语音间的时序相关性，并在潜在运动空间中完成生成；进而设计最优运动选择模块，以生成长期连贯一致的连续手势视频。为提升视觉感知效果，我们还构建了针对特定区域缺失细节的优化网络。大量实验表明，本框架在运动评估和视频评估指标上均显著优于现有方法。相关代码、演示及资源已发布于 https://github.com/thuhcsi/S2G-MDDiffusion。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日