Although part-based motion synthesis networks have been investigated to reduce the complexity of modeling heterogeneous human motions, their computational cost remains prohibitive in interactive applications. To this end, we propose a novel two-part transformer network that aims to achieve high-quality, controllable motion synthesis results in real-time. Our network separates the skeleton into the upper and lower body parts, reducing the expensive cross-part fusion operations, and models the motions of each part separately through two streams of auto-regressive modules formed by multi-head attention layers. However, such a design might not sufficiently capture the correlations between the parts. We thus intentionally let the two parts share the features of the root joint and design a consistency loss to penalize the difference in the estimated root features and motions by these two auto-regressive modules, significantly improving the quality of synthesized motions. After training on our motion dataset, our network can synthesize a wide range of heterogeneous motions, like cartwheels and twists. Experimental and user study results demonstrate that our network is superior to state-of-the-art human motion synthesis networks in the quality of generated motions.
翻译:尽管基于部位的运动合成网络已被研究用于降低异质人体运动建模的复杂性,但其计算成本在交互式应用中仍然高不可攀。为此,我们提出一种新颖的两部分式Transformer网络,旨在实时实现高质量、可控的运动合成结果。该网络将骨骼分为上半身和下半身,减少了昂贵跨部位融合操作,并通过两个由多头注意力层构成的自回归模块流分别对每个部位的运动进行建模。然而,这种设计可能不足以充分捕捉部位之间的相关性。因此,我们有意让两部分共享根节点特征,并设计一致性损失以惩罚这两个自回归模块估计的根节点特征与运动之间的差异,从而显著提升合成运动的质量。在我们的运动数据集上训练后,该网络能够合成包括侧手翻和扭转在内的各类异质运动。实验与用户研究结果表明,在生成运动的质量方面,我们的网络优于最先进的人体运动合成网络。