Most text-driven human motion generation methods employ sequential modeling approaches, e.g., transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-training weights are available at https://github.com/jpthu17/GraphMotion.
翻译:大多数基于文本驱动的人体运动生成方法采用序列建模方式(如Transformer),自动且隐式地提取句级文本表示以合成人体运动。然而,这种紧凑的文本表示可能过度强调动作名称而忽略其他重要属性,且缺乏引导细微差异运动合成的细粒度细节。本文提出层级语义图以实现对运动生成的细粒度控制。具体而言,我们将运动描述解耦为包含动作、行为和细节三个层级的语义图。这种从全局到局部的结构有助于全面理解运动描述并实现细粒度运动生成控制。相应地,为利用层级语义图的粗到细拓扑结构,我们将文本到运动的扩散过程分解为三个语义层级,分别对应捕获整体运动、局部动作和动作细节。在HumanML3D和KIT两个基准人体运动数据集上的大量实验表明,本方法性能优越。更令人鼓舞的是,通过调整层级语义图的边权重,本方法可连续优化生成的运动,这对领域发展具有深远影响。代码与预训练权重见https://github.com/jpthu17/GraphMotion。