Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}.
翻译:基于扩散模型和自回归模型的文本到运动生成方法近年来取得了显著进展。然而,这些模型往往在实时性能、高保真度和运动可编辑性之间面临权衡。为解决这一问题,我们提出MMM——一种基于掩码运动模型的简单新颖运动生成范式。MMM包含两个关键组件:(1)运动分词器,将三维人体运动转换为潜在空间中的离散标记序列;(2)条件掩码运动变换器,通过学习预测随机掩码的运动标记,并以预计算的文本标记为条件。通过全向关注运动标记和文本标记,MMM显式捕获了运动标记间的内在依赖关系以及运动与文本标记间的语义映射。在推理阶段,这支持对多个与细粒度文本描述高度一致的运动标记进行并行迭代解码,从而同时实现高保真度和高速运动生成。此外,MMM具有天生的运动可编辑性:只需在需要编辑的位置放置掩码标记,MMM即可自动填充间隙,同时确保编辑与非编辑部分间的平滑过渡。在HumanML3D和KIT-ML数据集上的大量实验表明,MMM在生成高质量运动(FID分数分别达0.08和0.429)方面超越当前主流方法,同时提供身体部位修改、运动插帧及长运动序列合成等高级编辑功能。此外,在单块中端GPU上,MMM比可编辑运动扩散模型快两个数量级。项目页面:\url{https://exitudio.github.io/MMM-page}。