Controllable human motion synthesis is essential for applications in AR/VR, gaming, movies, and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a new method for realistic human motion synthesis, incorporating both low-level trajectory and high-level language semantics controls. Specifically, we first train a VQ-VAE to learn a compact latent motion space organized by body parts. We then propose a Masked Trajectories Transformer to make coarse initial predictions of full trajectories of joints based on the learned latent motion space, with user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce an efficient test-time optimization to refine these coarse predictions for accurate trajectory control. Experiments demonstrate that TLControl outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.
翻译:可控人体运动合成对于增强现实/虚拟现实、游戏、电影以及具身智能等应用至关重要。现有方法通常仅专注于语言或完整轨迹控制,缺乏与用户指定轨迹(尤其是多关节控制)精确对齐的运动合成能力。为解决这些问题,我们提出TLControl——一种融合低级轨迹控制与高级语言语义控制的新型逼真人体运动合成方法。具体而言,我们首先训练VQ-VAE模型,以学习按身体部位组织的紧凑潜在运动空间。随后提出掩码轨迹Transformer,基于所学潜在运动空间,以用户指定的部分轨迹和文本描述为条件,对关节完整轨迹进行粗粒度初始预测。最后引入高效测试时优化策略,修正粗预测结果以实现精确轨迹控制。实验表明,TLControl在轨迹精度和时间效率上均超越现有最优方法,适用于交互式高质量动画生成。