Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT.
翻译:摘要:近年来,基于文本的动作生成取得了显著进展,能够生成符合文本描述的多样化且高质量的人体动作。然而,生成超出原始数据集分布范围的动作(即零样本生成)仍具挑战性。通过采用分而治之的策略,我们提出了一种名为细粒度人体动作扩散模型(FG-MDM)的新框架,用于零样本人体动作生成。具体而言,我们首先利用大语言模型将先前模糊的文本标注解析为不同身体部位的细粒度描述。随后,我们使用这些细粒度描述引导基于Transformer的扩散模型,该模型进一步采用了部位标记的设计。得益于更接近动作本质的描述,FG-MDM能够生成超出原始数据集范围的人体动作。我们的实验结果证明了FG-MDM在零样本设置下相较于先前方法的优越性。我们将公开发布为HumanML3D和KIT生成的细粒度文本标注。