Text-to-motion generation is a formidable task, aiming to produce human motions that align with the input text while also adhering to human capabilities and physical laws. While there have been advancements in diffusion models, their application in discrete spaces remains underexplored. Current methods often overlook the varying significance of different motions, treating them uniformly. It is essential to recognize that not all motions hold the same relevance to a particular textual description. Some motions, being more salient and informative, should be given precedence during generation. In response, we introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM), which utilizes a Transformer-based VQ-VAE to derive a concise, discrete motion representation, incorporating a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token within the entire motion sequence. This approach retains the most salient motions during the reverse diffusion process, leading to more semantically rich and varied motions. Additionally, we formulate two strategies to gauge the importance of motion tokens, drawing from both textual and visual indicators. Comprehensive experiments on the HumanML3D and KIT-ML datasets confirm that our model surpasses existing techniques in fidelity and diversity, particularly for intricate textual descriptions.
翻译:文本到运动生成是一项艰巨的任务,旨在生成与输入文本一致且符合人体运动能力与物理规律的人体运动。尽管扩散模型已取得进展,但其在离散空间中的应用仍未被充分探索。当前方法常忽视不同运动的重要性差异,对它们一视同仁。实际上,必须认识到并非所有运动对特定文本描述具有同等相关性。某些运动更显著且信息量更大,应在生成过程中被赋予优先权。为此,我们提出一种以优先级为中心的运动离散扩散模型(M2DM),该模型利用基于Transformer的VQ-VAE获取简洁的离散运动表示,并通过全局自注意力机制和正则化项防止代码坍塌。我们还提出一种运动离散扩散模型,采用由整个运动序列中每个运动令牌的重要性决定的创新噪声调度策略。这种方法在逆向扩散过程中保留最显著的运动,从而生成语义更丰富、更多样的运动。此外,我们制定了两种从文本和视觉指标中评估运动令牌重要性的策略。在HumanML3D和KIT-ML数据集上的全面实验证实,我们的模型在保真度和多样性方面均超越现有技术,尤其适用于复杂文本描述。