Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. First, we disentangle complex human motions into a set of atomic actions during learning, and then assemble novel actions using the learned atomic actions, which offers better adaptability to new actions. Moreover, we introduce a curriculum learning training strategy that leverages masked motion modeling with a gradual increase in the mask ratio, and thus facilitates atomic action assembly. This approach mitigates the overfitting problem commonly encountered in previous methods while enforcing the model to learn better motion representations. We demonstrate the effectiveness of ATOM through extensive experiments, including text-to-motion and action-to-motion synthesis tasks. We further illustrate its superiority in synthesizing plausible and coherent text-guided human motion sequences.
翻译:语言引导的人体运动合成一直是一项具有挑战性的任务,原因在于人类行为的固有复杂性和多样性。以往的方法在泛化到新动作方面存在局限,往往导致不真实或不连贯的运动序列。本文提出ATOM(原子动作建模)方法以缓解该问题,通过将动作分解为原子动作,并采用课程学习策略来学习原子动作的合成。首先,在学习过程中将复杂人体运动解耦为一组原子动作,随后利用已学习的原子动作组装新动作,从而更好地适应新动作。此外,我们引入基于课程学习的训练策略,借助掩码运动建模并逐步提高掩码比例,从而促进原子动作的组装。该方法缓解了以往方法中常见的过拟合问题,同时强制模型学习更好的运动表征。通过广泛的实验,包括文本到运动和动作到运动合成任务,我们验证了ATOM的有效性,并进一步展示了其在合成合理且连贯的文本引导人体运动序列方面的优越性。