The task of text2motion is to generate motion sequences from given textual descriptions, where a model should explore the interactions between natural language instructions and human body movements. While most existing works are confined to coarse-grained motion descriptions (e.g., "A man squats."), fine-grained ones specifying movements of relevant body parts are barely explored. Models trained with coarse texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure in generating motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset with fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with delicate prompts. Accordingly, we design a new text2motion model, FineMotionDiffuse, which makes full use of fine-grained textual information. Our experiments show that FineMotionDiffuse trained on FineHumanML3D acquires good results in quantitative evaluation. We also find this model can better generate spatially/chronologically composite motions by learning the implicit mappings from simple descriptions to the corresponding basic motions.
翻译:文本到运动(text2motion)的任务是根据给定的文本描述生成运动序列,模型需探索自然语言指令与人体运动之间的交互关系。现有工作大多局限于粗粒度运动描述(如“一名男子下蹲”),而针对相关身体部位运动进行细粒度描述的研究仍属空白。基于粗粒度文本训练的模型难以习得细粒度运动词汇与运动基元之间的映射关系,导致无法根据未见描述生成运动。本文通过向GPT-3.5-turbo输入精细提示,构建了包含细粒度文本描述的大规模语言-运动数据集FineHumanML3D。据此,我们设计了一种新型文本到运动模型FineMotionDiffuse,该模型充分利用了细粒度文本信息。实验表明,在FineHumanML3D上训练的FineMotionDiffuse在定量评估中取得了优异结果。我们还发现,该模型通过学习简单描述与对应基本运动之间的隐式映射,能够更好地生成时空复合运动。