DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

Text-conditioned human motion generation, which allows for user interaction through natural language, has become increasingly popular. Existing methods typically generate short, isolated motions based on a single input sentence. However, human motions are continuous and can extend over long periods, carrying rich semantics. Creating long, complex motions that precisely respond to streams of text descriptions, particularly in an online and real-time setting, remains a significant challenge. Furthermore, incorporating spatial constraints into text-conditioned motion generation presents additional challenges, as it requires aligning the motion semantics specified by text descriptions with geometric information, such as goal locations and 3D scene geometry. To address these limitations, we propose DART, a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. Our model, DART, effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. By autoregressively generating motion primitives based on the preceding history and current text input, DART enables real-time, sequential motion generation driven by natural language descriptions. Additionally, the learned motion primitive space allows for precise spatial motion control, which we formulate either as a latent noise optimization problem or as a Markov decision process addressed through reinforcement learning. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks. Experiments show our method outperforms existing baselines in motion realism, efficiency, and controllability. Video results are available on the project page: https://zkf1997.github.io/DART/.

翻译：文本条件人体运动生成允许通过自然语言进行用户交互，已变得越来越流行。现有方法通常基于单个输入句子生成简短、孤立的运动。然而，人体运动是连续的，可以持续较长时间，并承载丰富的语义。创建能够精确响应文本描述流的长而复杂的运动，尤其是在在线和实时环境中，仍然是一个重大挑战。此外，将空间约束纳入文本条件运动生成带来了额外的挑战，因为这需要将文本描述指定的运动语义与几何信息（如目标位置和3D场景几何）对齐。为了解决这些局限性，我们提出了DART，一种基于扩散的自回归运动基元模型，用于实时文本驱动的运动控制。我们的模型DART利用潜在扩散模型，有效地学习了一个紧凑的运动基元空间，该空间联合条件于运动历史和文本输入。通过基于先前历史和当前文本输入自回归地生成运动基元，DART实现了由自然语言描述驱动的实时、顺序运动生成。此外，学习到的运动基元空间允许进行精确的空间运动控制，我们将其表述为潜在噪声优化问题或通过强化学习解决的马尔可夫决策过程。我们为这两种方法提出了有效的算法，展示了我们的模型在各种运动合成任务中的多功能性和卓越性能。实验表明，我们的方法在运动真实性、效率和可控性方面优于现有基线。视频结果可在项目页面查看：https://zkf1997.github.io/DART/。