Although part-based motion synthesis networks have been investigated to reduce the complexity of modeling heterogeneous human motions, their computational cost remains prohibitive in interactive applications. To this end, we propose a novel two-part transformer network that aims to achieve high-quality, controllable motion synthesis results in real-time. Our network separates the skeleton into the upper and lower body parts, reducing the expensive cross-part fusion operations, and models the motions of each part separately through two streams of auto-regressive modules formed by multi-head attention layers. However, such a design might not sufficiently capture the correlations between the parts. We thus intentionally let the two parts share the features of the root joint and design a consistency loss to penalize the difference in the estimated root features and motions by these two auto-regressive modules, significantly improving the quality of synthesized motions. After training on our motion dataset, our network can synthesize a wide range of heterogeneous motions, like cartwheels and twists. Experimental and user study results demonstrate that our network is superior to state-of-the-art human motion synthesis networks in the quality of generated motions.
翻译:尽管基于部分的动作合成网络已被研究用于降低异构人体动作建模的复杂性,但其计算成本在交互式应用中仍然过高。为此,我们提出了一种新颖的两部件变压器网络,旨在实时实现高质量、可控的动作合成结果。我们的网络将骨架划分为上半身和下半身,减少了昂贵的跨部件融合操作,并通过两个由多头注意力层构成的自回归模块流分别建模每个部件的动作。然而,这种设计可能不足以充分捕捉部件之间的相关性。因此,我们有意让两个部件共享根关节的特征,并设计了一个一致性损失来惩罚这两个自回归模块估计的根特征和动作之间的差异,从而显著提高了合成动作的质量。在我们的动作数据集上训练后,我们的网络能够合成各种异构动作,如侧手翻和扭转。实验和用户研究结果表明,我们的网络在生成动作的质量上优于最先进的人体动作合成网络。