Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models.
翻译:手势合成是人机交互的重要领域,在电影、机器人和虚拟现实等多个领域具有广泛应用。近期研究利用扩散模型和注意力机制改进了手势合成。然而,由于这些技术计算复杂度较高,以低延迟生成长序列且多样化的动作仍具挑战性。我们探索了状态空间模型(SSMs)应对这一挑战的潜力,采用离散运动先验的两阶段建模策略以提升手势质量。基于基础Mamba模块,我们提出了MambaTalk,通过多模态融合增强了手势的多样性与节奏感。大量实验表明,我们的方法在性能上达到或超越了当前最先进的模型。