The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset. Moreover, it achieves state-of-the-art results in motion inbetweening on the LaFAN1 dataset, particularly in long transition periods. More information can be found on the project website https://sites.google.com/view/estevevallsmascaro/publications/unimask-m.
翻译:人体运动合成传统上通过面向特定任务的模型来实现,这些模型专注于解决特定挑战,例如预测未来运动或基于已知关键姿态填充中间姿态。本文提出了一种名为UNIMASK-M的新型任务无关模型,该模型采用统一架构有效应对这些挑战,在各领域均取得与最先进方法相当或更优的性能。受视觉Transformer(ViT)启发,UNIMASK-M模型将人体姿态分解为身体部位,以利用人体运动中存在的时空关系。此外,我们将多种姿态条件运动合成任务重构为基于不同掩码模式输入的重建问题。通过显式告知模型被掩码的关节位置,UNIMASK-M对遮挡具有更强的鲁棒性。实验结果表明,该模型在Human3.6M数据集上成功实现了人体运动预测,同时在LaFAN1数据集上的运动插值任务(尤其长过渡周期)达到了最先进水平。更多信息请访问项目网站https://sites.google.com/view/estevevallsmascaro/publications/unimask-m。