Human motion prediction is a classical problem in computer vision and computer graphics, which has a wide range of practical applications. Previous effects achieve great empirical performance based on an encoding-decoding fashion. The methods of this fashion work by first encoding previous motions to latent representations and then decoding the latent representations into predicted motions. However, in practice, they are still unsatisfactory due to several issues, including complicated loss constraints, cumbersome training processes, and scarce switch of different categories of motions in prediction. In this paper, to address the above issues, we jump out of the foregoing fashion and propose a novel framework from a new perspective. Specifically, our framework works in a denoising diffusion style. In the training stage, we learn a motion diffusion model that generates motions from random noise. In the inference stage, with a denoising procedure, we make motion prediction conditioning on observed motions to output more continuous and controllable predictions. The proposed framework enjoys promising algorithmic properties, which only needs one loss in optimization and is trained in an end-to-end manner. Additionally, it accomplishes the switch of different categories of motions effectively, which is significant in realistic tasks, \textit{e.g.}, the animation task. Comprehensive experiments on benchmarks confirm the superiority of the proposed framework. The project page is available at \url{https://lhchen.top/Human-MAC}.
翻译:人体运动预测是计算机视觉和计算机图形学中的经典问题,具有广泛的实践应用。以往的方法基于编码-解码范式取得了良好的实证性能。该范式的方法首先将先前运动编码为潜在表示,再将潜在表示解码为预测运动。然而,在实际应用中,这些方法仍存在若干不足,包括复杂的损失约束、繁琐的训练过程以及预测中不同运动类别间切换的困难。针对上述问题,本文跳出传统范式,提出一种全新视角的框架。具体而言,本框架采用去噪扩散方式工作:在训练阶段,学习一个从随机噪声生成运动的运动扩散模型;在推理阶段,通过去噪过程,以观测到的运动为条件进行运动预测,输出更连续且可控的预测结果。该框架具有优异的算法特性,仅需单一损失函数进行优化,并以端到端方式训练。此外,它能够有效完成不同运动类别间的切换,这在动画等实际任务中具有重要意义。基准测试上的综合实验证实了所提框架的优越性。项目页面见 \url{https://lhchen.top/Human-MAC}。