Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform conditional mechanism across all tokens to robustly model the interplay between the fuzzy features and the resultant gesture sequence. This innovative approach guarantees high fidelity in gesture-speech synchronization while maintaining the naturalness of the gestures. Employing a diffusion model for training and inference, our framework has undergone extensive subjective and objective evaluations on the ZEGGS and BEAT datasets. These assessments substantiate our model's enhanced performance relative to contemporary state-of-the-art methods, demonstrating competitive outcomes with the DiTs architecture (Persona-Gestors) while optimizing memory usage and accelerating inference speed.
翻译:语音驱动的手势生成是虚拟人创建领域的一个新兴方向,当前方法主要采用基于Transformer的架构,这些架构需要大量内存且推理速度较慢。针对这些局限性,我们提出了\textit{DiM-Gesture},一种新颖的端到端生成模型,旨在仅从原始语音音频中生成高度个性化的3D全身手势,并采用基于Mamba的架构。该模型将基于Mamba的模糊特征提取器与非自回归的自适应层归一化(AdaLN)Mamba-2扩散架构相结合。该提取器利用Mamba框架和预训练的WavLM模型,自主推导出隐式的、连续的模糊特征,然后将这些特征统一为一个单一的潜在特征。该特征由AdaLN Mamba-2处理,后者在所有令牌上实施统一的调节机制,以稳健地建模模糊特征与生成的手势序列之间的相互作用。这种创新方法保证了手势与语音同步的高保真度,同时保持了手势的自然性。采用扩散模型进行训练和推理,我们的框架已在ZEGGS和BEAT数据集上进行了广泛的主观和客观评估。这些评估证实了我们的模型相对于当前最先进方法的性能提升,展示了与DiTs架构(Persona-Gestors)相竞争的结果,同时优化了内存使用并加快了推理速度。