Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions. The challenges are two-fold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner. Furthermore, we present its generalization for X-to-Motion with "No Modality Left Behind", enabling the generation of high-definition and high-fidelity human motions based on user-defined modality input.
翻译:人体运动生成旨在根据文本或音频等多种条件输入生成合理的人体运动序列。尽管现有方法能够基于简短提示和简单运动模式生成运动,但在处理长提示或复杂运动时仍面临困难。挑战主要来自两方面:1) 针对长提示和复杂运动的人体运动捕捉数据稀缺;2) 人体运动在时间域的高多样性以及条件模态间的显著分布差异,导致生成复杂长文本对应的运动时存在多对多映射问题。本研究通过以下方式解决这些不足:1) 精心构建首个长文本描述与3D复杂运动配对的数据集(HumanLong3D);2) 提出自回归运动扩散模型(AMD)。具体而言,AMD将当前时间步的文本提示与先前时间步的文本提示及动作序列整合为条件信息,以迭代方式预测当前动作序列。此外,我们展示了该模型在"X到运动"生成任务中"不遗漏任何模态"的泛化能力,能够基于用户定义的模态输入生成高清晰度、高保真度的人体运动。