Human motion prediction is a classical problem in computer vision and computer graphics, which has a wide range of practical applications. Previous effects achieve great empirical performance based on an encoding-decoding style. The methods of this style work by first encoding previous motions to latent representations and then decoding the latent representations into predicted motions. However, in practice, they are still unsatisfactory due to several issues, including complicated loss constraints, cumbersome training processes, and scarce switch of different categories of motions in prediction. In this paper, to address the above issues, we jump out of the foregoing style and propose a novel framework from a new perspective. Specifically, our framework works in a masked completion fashion. In the training stage, we learn a motion diffusion model that generates motions from random noise. In the inference stage, with a denoising procedure, we make motion prediction conditioning on observed motions to output more continuous and controllable predictions. The proposed framework enjoys promising algorithmic properties, which only needs one loss in optimization and is trained in an end-to-end manner. Additionally, it accomplishes the switch of different categories of motions effectively, which is significant in realistic tasks, e.g., the animation task. Comprehensive experiments on benchmarks confirm the superiority of the proposed framework. The project page is available at https://lhchen.top/Human-MAC.
翻译:人体运动预测是计算机视觉与计算机图形学领域的经典问题,具有广泛的实际应用价值。现有方法基于编解码框架取得了良好的经验性能:该框架首先将历史运动编码为隐式表征,再通过解码生成预测运动。然而在实际应用中,此类方法仍存在不足,包括复杂的损失约束、繁琐的训练流程以及预测时不同运动类别间的转换困难。为攻克上述难题,本文跳出传统编解码范式,提出一种从全新视角构建的框架。具体而言,我们的框架采用掩码补全机制:训练阶段学习从随机噪声生成运动的运动扩散模型,推理阶段通过去噪过程,以观测运动为条件进行运动预测,从而输出更连续、可控的预测结果。该框架具有优越的算法特性——优化过程仅需单一损失函数,并以端到端方式完成训练。此外,它还能有效实现不同运动类别间的切换,这对动画生成等实际任务至关重要。在基准数据集上的全面实验验证了本框架的优越性。项目页面详见 https://lhchen.top/Human-MAC。