Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps.We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change.Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.
翻译:先前的掩码建模运动生成方法主要研究文本到运动。我们提出了DiMo,一种离散扩散式框架,将掩码建模扩展到双向文本-运动理解与生成。与将运动标记化并顺序解码的GPT风格自回归方法不同,DiMo执行迭代式掩码标记优化,在单一模型内统一了文本到运动、运动到文本以及无文本的运动到运动任务。这种解码范式通过优化步骤的数量,在推理时自然地实现了质量与延迟的权衡。我们通过残差向量量化进一步提高了运动标记的保真度,并通过组相对策略优化增强了对齐性和可控性。在HumanML3D和KIT-ML数据集上的实验表明,该框架在统一模型下实现了优异的运动质量和具有竞争力的双向理解能力。此外,我们展示了模型在无需架构修改的情况下,具备无文本运动补全、文本引导运动预测以及运动描述修正的能力。更多定性结果请参见项目页面:https://animotionlab.github.io/DiMo/。