Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions. The challenges are two-fold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner. Furthermore, we present its generalization for X-to-Motion with "No Modality Left Behind", enabling for the first time the generation of high-definition and high-fidelity human motions based on user-defined modality input.
翻译:人体运动生成旨在根据文本或音频等多种条件输入,生成合理的人体运动序列。尽管现有方法能够基于简短提示和简单运动模式生成运动,但在处理长提示或复杂运动时仍面临困难。这些挑战体现在两个方面:1)针对长提示和复杂运动的人体运动捕捉数据稀缺;2)人体运动在时间维度上的高度多样性以及条件模态分布的显著差异,导致在生成复杂长文本对应的运动时出现多对多映射问题。本研究通过以下方式解决这些不足:1)构建首个配对长文本描述与3D复杂运动的数据集(HumanLong3D);2)提出自回归运动扩散模型(AMD)。具体而言,AMD将当前时间步的文本提示与前一时间步的文本提示及动作序列作为条件信息,以迭代方式预测当前动作序列。此外,我们提出了"不遗漏任何模态"的X到运动通用框架,首次实现了基于用户定义模态输入的高清高保真人体运动生成。