Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture. DiM-Gestor features a dual-component framework: (1) a fuzzy feature extractor and (2) a speech-to-gesture mapping module, both built on the Mamba-2. The fuzzy feature extractor, integrated with a Chinese Pre-trained Model and Mamba-2, autonomously extracts implicit, continuous speech features. These features are synthesized into a unified latent representation and then processed by the speech-to-gesture mapping module. This module employs an Adaptive Layer Normalization (AdaLN)-enhanced Mamba-2 mechanism to uniformly apply transformations across all sequence tokens. This enables precise modeling of the nuanced interplay between speech features and gesture dynamics. We utilize a diffusion model to train and infer diverse gesture outputs. Extensive subjective and objective evaluations conducted on the newly released Chinese Co-Speech Gestures dataset corroborate the efficacy of our proposed model. Compared with Transformer-based architecture, the assessments reveal that our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times. Additionally, we released the CCG dataset, a Chinese Co-Speech Gestures dataset, comprising 15.97 hours (six styles across five scenarios) of 3D full-body skeleton gesture motion performed by professional Chinese TV broadcasters.
翻译:基于Transformer的生成模型进行语音驱动的手势生成是虚拟人创建领域中一个快速发展的方向。然而,现有模型因其二次时间与空间复杂度而面临重大挑战,限制了可扩展性与效率。为应对这些局限,我们提出了DiM-Gestor,一种创新的端到端生成模型,其利用了Mamba-2架构。DiM-Gestor采用双组件框架:(1) 模糊特征提取器与(2) 语音-手势映射模块,二者均构建于Mamba-2之上。该模糊特征提取器集成了中文预训练模型与Mamba-2,能够自主提取隐含的、连续的语音特征。这些特征被合成为一个统一的潜在表示,随后由语音-手势映射模块进行处理。该模块采用一种自适应层归一化增强的Mamba-2机制,对所有序列标记统一施加变换,从而能够精确建模语音特征与手势动态之间微妙的相互作用。我们利用扩散模型来训练并推断多样化的手势输出。在新发布的中文伴随语音手势数据集上进行的大量主观与客观评估证实了我们所提出模型的有效性。与基于Transformer的架构相比,评估结果表明,我们的方法取得了具有竞争力的结果,并显著降低了约2.4倍的内存使用量,同时将推理速度提升了2至4倍。此外,我们发布了CCG数据集,这是一个中文伴随语音手势数据集,包含由专业中国电视播音员表演的15.97小时(涵盖五种场景下的六种风格)的3D全身骨架手势运动。