Human motion prediction is a classical problem in computer vision and computer graphics, which has a wide range of practical applications. Previous effects achieve great empirical performance based on an encoding-decoding style. The methods of this style work by first encoding previous motions to latent representations and then decoding the latent representations into predicted motions. However, in practice, they are still unsatisfactory due to several issues, including complicated loss constraints, cumbersome training processes, and scarce switch of different categories of motions in prediction. In this paper, to address the above issues, we jump out of the foregoing style and propose a novel framework from a new perspective. Specifically, our framework works in a masked completion fashion. In the training stage, we learn a motion diffusion model that generates motions from random noise. In the inference stage, with a denoising procedure, we make motion prediction conditioning on observed motions to output more continuous and controllable predictions. The proposed framework enjoys promising algorithmic properties, which only needs one loss in optimization and is trained in an end-to-end manner. Additionally, it accomplishes the switch of different categories of motions effectively, which is significant in realistic tasks, e.g., the animation task. Comprehensive experiments on benchmarks confirm the superiority of the proposed framework. The project page is available at https://lhchen.top/Human-MAC.
翻译:人体运动预测是计算机视觉与计算机图形学领域的经典问题,具有广泛的实际应用价值。现有方法基于编码-解码范式取得了优异的经验性能,该类方法首先将历史运动编码为潜在表征,再通过解码生成预测运动。然而在实际应用中,此类方法仍存在若干不足,包括复杂的损失约束、繁琐的训练流程以及预测时不同运动类别间的切换困难等。为突破上述局限,本文跳出传统编码-解码范式,提出一种全新视角的框架。具体而言,本框架采用掩码补全机制:训练阶段学习从随机噪声生成运动的扩散模型;推理阶段通过去噪过程,基于观测运动条件实现更连续可控的预测。所提框架具备优越的算法特性,优化过程仅需单一损失函数,并以端到端方式进行训练。此外,该框架有效实现了不同运动类别的切换,这对动画生成等实际任务具有重要意义。基准数据集上的综合实验验证了本框架的优越性。项目主页见 https://lhchen.top/Human-MAC。