The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.
翻译:文本生成运动(text2motion)任务旨在根据给定的文本描述生成人体运动序列,模型需探索从自然语言指令到人体动作的多样化映射。然而,现有工作大多局限于粗粒度运动描述(如“一名男子下蹲”),鲜少涉及描述相关身体部位具体运动的细粒度描述。使用粗粒度文本训练的模型可能无法学习从细粒度运动相关词汇到运动基元的映射,导致无法从未见过的描述中生成运动。本文通过向GPT-3.5-turbo提供分步指令并辅以伪代码强制检查,构建了专注于细粒度文本描述的大规模语言-运动数据集FineHumanML3D。据此,我们设计了一种充分利用细粒度文本信息的新型text2motion模型FineMotionDiffuse。量化评估表明,在FineHumanML3D上训练的FineMotionDiffuse相较于竞争基线,FID指标提升了0.38。根据定性评估和案例分析,我们的模型通过学习细粒度描述到相应基本运动的隐式映射,在生成空间或时间复合运动方面优于MotionDiffuse。我们已在https://github.com/KunhangL/finemotiondiffuse 公开数据。