DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

Text-conditioned human motion generation, which allows for user interaction through natural language, has become increasingly popular. Existing methods typically generate short, isolated motions based on a single input sentence. However, human motions are continuous and can extend over long periods, carrying rich semantics. Creating long, complex motions that precisely respond to streams of text descriptions, particularly in an online and real-time setting, remains a significant challenge. Furthermore, incorporating spatial constraints into text-conditioned motion generation presents additional challenges, as it requires aligning the motion semantics specified by text descriptions with geometric information, such as goal locations and 3D scene geometry. To address these limitations, we propose DartControl, in short DART, a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. Our model effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. By autoregressively generating motion primitives based on the preceding history and current text input, DART enables real-time, sequential motion generation driven by natural language descriptions. Additionally, the learned motion primitive space allows for precise spatial motion control, which we formulate either as a latent noise optimization problem or as a Markov decision process addressed through reinforcement learning. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks. Experiments show our method outperforms existing baselines in motion realism, efficiency, and controllability. Video results are available on the project page: https://zkf1997.github.io/DART/.

翻译：文本条件人体运动生成允许通过自然语言进行用户交互，已变得越来越流行。现有方法通常基于单个输入句子生成短而孤立的运动。然而，人体运动是连续的，可以持续较长时间，并承载丰富的语义。创建能够精确响应文本描述流的长而复杂的运动，特别是在在线和实时环境中，仍然是一个重大挑战。此外，将空间约束纳入文本条件运动生成提出了额外的挑战，因为它需要将文本描述指定的运动语义与几何信息（如目标位置和3D场景几何）对齐。为了解决这些局限性，我们提出了DartControl（简称DART），一种基于扩散的自回归运动基元模型，用于实时文本驱动的运动控制。我们的模型利用潜在扩散模型，有效地学习了一个紧凑的运动基元空间，该空间联合条件于运动历史和文本输入。通过基于先前历史和当前文本输入自回归地生成运动基元，DART实现了由自然语言描述驱动的实时、序列化运动生成。此外，学习到的运动基元空间允许精确的空间运动控制，我们将其表述为潜在噪声优化问题或通过强化学习解决的马尔可夫决策过程。我们为这两种方法提出了有效的算法，展示了我们的模型在各种运动合成任务中的多功能性和卓越性能。实验表明，我们的方法在运动真实性、效率和可控性方面优于现有基线。视频结果可在项目页面查看：https://zkf1997.github.io/DART/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/