DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2

Fan Zhang,Siyuan Zhao,Naye Ji,Zhaohan Wang,Jingmei Wu,Fuxing Gao,Zhenqing Ye,Leyao Yan,Lanxin Dai,Weidong Geng,Xin Lyu,Bozuo Zhao,Dingguo Yu,Hui Du,Bin Hu

from arxiv, 13 pages, 11 figures

Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture. DiM-Gestor features a dual-component framework: (1) a fuzzy feature extractor and (2) a speech-to-gesture mapping module, both built on the Mamba-2. The fuzzy feature extractor, integrated with a Chinese Pre-trained Model and Mamba-2, autonomously extracts implicit, continuous speech features. These features are synthesized into a unified latent representation and then processed by the speech-to-gesture mapping module. This module employs an Adaptive Layer Normalization (AdaLN)-enhanced Mamba-2 mechanism to uniformly apply transformations across all sequence tokens. This enables precise modeling of the nuanced interplay between speech features and gesture dynamics. We utilize a diffusion model to train and infer diverse gesture outputs. Extensive subjective and objective evaluations conducted on the newly released Chinese Co-Speech Gestures dataset corroborate the efficacy of our proposed model. Compared with Transformer-based architecture, the assessments reveal that our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times. Additionally, we released the CCG dataset, a Chinese Co-Speech Gestures dataset, comprising 15.97 hours (six styles across five scenarios) of 3D full-body skeleton gesture motion performed by professional Chinese TV broadcasters.

翻译：基于Transformer的生成模型进行语音驱动的手势生成是虚拟人创建领域中一个快速发展的方向。然而，现有模型因其二次时间与空间复杂度而面临重大挑战，限制了可扩展性与效率。为应对这些局限，我们提出了DiM-Gestor，一种创新的端到端生成模型，其利用了Mamba-2架构。DiM-Gestor采用双组件框架：(1) 模糊特征提取器与(2) 语音-手势映射模块，二者均构建于Mamba-2之上。该模糊特征提取器集成了中文预训练模型与Mamba-2，能够自主提取隐含的、连续的语音特征。这些特征被合成为一个统一的潜在表示，随后由语音-手势映射模块进行处理。该模块采用一种自适应层归一化增强的Mamba-2机制，对所有序列标记统一施加变换，从而能够精确建模语音特征与手势动态之间微妙的相互作用。我们利用扩散模型来训练并推断多样化的手势输出。在新发布的中文伴随语音手势数据集上进行的大量主观与客观评估证实了我们所提出模型的有效性。与基于Transformer的架构相比，评估结果表明，我们的方法取得了具有竞争力的结果，并显著降低了约2.4倍的内存使用量，同时将推理速度提升了2至4倍。此外，我们发布了CCG数据集，这是一个中文伴随语音手势数据集，包含由专业中国电视播音员表演的15.97小时（涵盖五种场景下的六种风格）的3D全身骨架手势运动。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日