In this paper, we propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis, which exploits a strategy named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy sets up generation objectives by grouping body joints of detailed skeletons in close semantic proximity together and then replacing each of such joint group with a single body-part node. Such an operation recursively abstracts a human pose to coarser and coarser skeletons at multiple granularity levels. With gradually increasing the abstraction level, human motion becomes more and more concise and stable, significantly benefiting the cross-modal motion synthesis task. The whole text-driven human motion synthesis problem is then divided into multiple abstraction levels and solved with a multi-stage generation framework with a cascaded latent diffusion model: an initial generator first generates the coarsest human motion guess from a given text description; then, a series of successive generators gradually enrich the motion details based on the textual description and the previous synthesized results. Notably, we further integrate GUESS with the proposed dynamic multi-condition fusion mechanism to dynamically balance the cooperative effects of the given textual condition and synthesized coarse motion prompt in different generation stages. Extensive experiments on large-scale datasets verify that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS.
翻译:本文提出一种新颖的级联扩散生成框架,用于文本驱动的人体运动合成,该方法采用名为"渐进式丰富合成"(简称GUESS)的策略。该策略通过将详细骨架中语义邻近的关节分组,并将每组替换为单一身体部位节点来设定生成目标。这种操作递归地将人体姿态抽象为多个粒度层级下越来越粗糙的骨架。随着抽象层级逐步提升,人体运动变得愈发简洁稳定,从而显著促进跨模态运动合成任务。整个文本驱动的人体运动合成问题被分解为多个抽象层级,并通过基于级联潜在扩散模型的多阶段生成框架求解:初始生成器首先根据给定文本描述生成最粗略的人体运动猜测;随后一系列连续生成器基于文本描述和先前合成结果逐步丰富运动细节。值得注意的是,我们进一步将GUESS与所提出的动态多条件融合机制相结合,以动态平衡不同生成阶段中给定文本条件与合成粗运动提示的协同效果。在大规模数据集上的大量实验表明,GUESS在精度、真实性和多样性方面均大幅超越现有最优方法。代码开源地址:https://github.com/Xuehao-Gao/GUESS。