Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See https://www.speech.kth.se/research/listen-denoise-action/ for video examples, data, and code.
翻译:扩散模型作为具有高度表现力且可高效训练的概率模型,近期引发了广泛关注。我们证明此类模型极适合合成与音频共现的人体动作(如舞蹈和伴语手势),因为动作在给定音频条件下具有复杂性和高度模糊性,需要概率化描述。具体而言,我们改造DiffWave架构以建模三维姿态序列,用Conformer替代扩张卷积以提升建模能力。同时,我们通过无分类器引导调控风格化表达强度,实现对动作风格的控制。手势与舞蹈生成的实验证实,所提方法能实现顶级动作质量,并呈现特色鲜明的风格,且其表达强度可调。我们还利用相同模型架构合成了路径驱动运动。最后,我们将引导过程泛化,获得扩散模型的专家乘积集成,并展示其在风格插值等方面的应用——我们认为此贡献具有独立研究价值。视频示例、数据和代码详见https://www.speech.kth.se/research/listen-denoise-action/。