Human motion prediction is a classical problem in computer vision and computer graphics, which has a wide range of practical applications. Previous effects achieve great empirical performance based on an encoding-decoding style. The methods of this style work by first encoding previous motions to latent representations and then decoding the latent representations into predicted motions. However, in practice, they are still unsatisfactory due to several issues, including complicated loss constraints, cumbersome training processes, and scarce switch of different categories of motions in prediction. In this paper, to address the above issues, we jump out of the foregoing style and propose a novel framework from a new perspective. Specifically, our framework works in a masked completion fashion. In the training stage, we learn a motion diffusion model that generates motions from random noise. In the inference stage, with a denoising procedure, we make motion prediction conditioning on observed motions to output more continuous and controllable predictions. The proposed framework enjoys promising algorithmic properties, which only needs one loss in optimization and is trained in an end-to-end manner. Additionally, it accomplishes the switch of different categories of motions effectively, which is significant in realistic tasks, e.g., the animation task. Comprehensive experiments on benchmarks confirm the superiority of the proposed framework. The project page is available at https://lhchen.top/Human-MAC.
翻译:人体运动预测是计算机视觉与计算机图形学领域的经典问题,具有广泛的实践应用价值。现有方法基于编码-解码范式取得了良好的实证效果:首先将历史运动编码为潜在表征,再将该表征解码为预测运动序列。然而在实践中,此类方法仍存在损失约束复杂、训练流程繁琐、预测中不同运动类别转换困难等问题。针对上述局限,本文跳出传统编码-解码框架,从全新视角提出一种创新性方法。具体而言,本框架采用掩码补全机制:训练阶段学习一个从随机噪声生成运动的扩散模型;推理阶段通过去噪过程,基于观测运动条件生成预测结果,实现更连续可控的运动预测。该框架具有优越的算法特性,优化过程仅需单一损失函数,且支持端到端训练。此外,本方法能有效实现不同运动类别间的自然切换,这对动画生成等实际任务具有重要意义。在多个基准数据集上的综合实验验证了本框架的优越性。项目主页见 https://lhchen.top/Human-MAC。