Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models.
翻译:手势合成是人机交互的重要领域,在电影、机器人及虚拟现实等多个领域具有广泛应用。近年来,扩散模型与注意力机制的进展推动了手势合成性能的提升。然而,由于这些技术计算复杂度高,生成低延迟的长序列和多样化手势仍面临挑战。我们探索了状态空间模型(SSMs)应对这一挑战的潜力,采用基于离散运动先验的两阶段建模策略以提升手势质量。基于基础的Mamba模块,我们提出MambaTalk,通过多模态整合增强手势多样性与节奏感。大量实验表明,我们的方法在性能上达到甚至超越了现有最优模型。