Recent work has demonstrated the significant potential of denoising diffusion models for generating human motion, including text-to-motion capabilities. However, these methods are restricted by the paucity of annotated motion data, a focus on single-person motions, and a lack of detailed control. In this paper, we introduce three forms of composition based on diffusion priors: sequential, parallel, and model composition. Using sequential composition, we tackle the challenge of long sequence generation. We introduce DoubleTake, an inference-time method with which we generate long animations consisting of sequences of prompted intervals and their transitions, using a prior trained only for short clips. Using parallel composition, we show promising steps toward two-person generation. Beginning with two fixed priors as well as a few two-person training examples, we learn a slim communication block, ComMDM, to coordinate interaction between the two resulting motions. Lastly, using model composition, we first train individual priors to complete motions that realize a prescribed motion for a given joint. We then introduce DiffusionBlending, an interpolation mechanism to effectively blend several such models to enable flexible and efficient fine-grained joint and trajectory-level control and editing. We evaluate the composition methods using an off-the-shelf motion diffusion model, and further compare the results to dedicated models trained for these specific tasks.
翻译:近期工作证明了去噪扩散模型在生成人体运动(包括文本到运动能力)方面的巨大潜力。然而,这些方法受限于标注运动数据的匮乏、仅关注单人运动,以及缺乏精细控制。本文基于扩散先验提出了三种组合形式:序列组合、并行组合和模型组合。通过序列组合,我们解决了长序列生成的挑战。我们提出DoubleTake——一种推理阶段方法,利用仅针对短片段训练的先验模型,生成由提示区间及其过渡组成的长动画。通过并行组合,我们展示了迈向双人运动生成的有前景的步骤。从两个固定先验模型和少量双人训练样本开始,我们学习一个轻量级通信模块ComMDM,以协调两个生成运动之间的交互。最后,通过模型组合,我们首先训练独立先验模型来完成满足指定关节预定运动的运动。随后引入DiffusionBlending——一种插值机制,可有效融合多个此类模型,实现对关节和轨迹层级灵活且高效的细粒度控制与编辑。我们使用现成的运动扩散模型评估组合方法,并将结果与针对特定任务训练的专用模型进行对比。