Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models.
翻译:手势合成是人机交互的重要领域,在电影、机器人学和虚拟现实等多个领域具有广泛应用。近期研究利用扩散模型和注意力机制改进了手势合成。然而,由于这些技术的高计算复杂度,以低延迟生成长且多样化的序列仍然是一个挑战。我们探索了状态空间模型(SSMs)解决这一挑战的潜力,采用离散运动先验的两阶段建模策略来提升手势质量。基于基础Mamba模块,我们提出了MambaTalk,通过多模态融合增强了手势的多样性和节奏感。大量实验表明,我们的方法达到或超越了当前最先进模型的性能。