Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.
翻译:先前的运动生成主要遵循两种范式:擅长运动学控制的连续扩散模型,以及适用于语义条件的离散分词式生成器。为融合两者优势,我们提出一个三阶段框架,包含条件特征提取(感知)、离散分词生成(规划)和基于扩散的运动合成(控制)。该框架的核心是MoTok——一种扩散式离散运动分词器,通过将运动恢复任务委托给扩散解码器,实现语义抽象与细粒度重建的解耦,从而在使用紧凑单层分词的同时保持运动保真度。针对运动学条件,粗粒度约束指导规划阶段的生成,而细粒度约束则在控制阶段通过基于扩散的优化强制执行。这种设计可防止运动学细节干扰语义分词规划。在HumanML3D数据集上,我们的方法使用仅为MaskControl六分之一的分词数量,显著提升了可控性与保真度,将轨迹误差从0.72厘米降至0.08厘米,FID从0.083降至0.029。与在强运动学约束下性能退化现有方法不同,本方法能提升保真度,将FID从0.033降至0.014。