Controllable human motion synthesis is essential for applications in AR/VR, gaming, movies, and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a new method for realistic human motion synthesis, incorporating both low-level trajectory and high-level language semantics controls. Specifically, we first train a VQ-VAE to learn a compact latent motion space organized by body parts. We then propose a Masked Trajectories Transformer to make coarse initial predictions of full trajectories of joints based on the learned latent motion space, with user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce an efficient test-time optimization to refine these coarse predictions for accurate trajectory control. Experiments demonstrate that TLControl outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.
翻译:可控人体运动合成对于AR/VR、游戏、电影以及具身人工智能等应用至关重要。现有方法往往仅专注于语言控制或完整轨迹控制之一,在合成与用户指定轨迹精确对齐的运动(尤其是多关节控制)方面缺乏精度。为解决这些问题,我们提出TLControl——一种融合低层次轨迹控制与高层次语言语义控制的新型逼真人体运动合成方法。具体而言,我们首先训练VQ-VAE学习由身体部位组织的紧凑潜在运动空间;继而提出掩码轨迹变换器,以学习到的潜在运动空间为基础,以用户指定的部分轨迹和文本描述为条件,对完整关节轨迹进行粗粒度初始预测;最后引入高效的测试时优化方法,对粗粒度预测进行精炼以实现精确轨迹控制。实验表明,TLControl在轨迹精度与时间效率上均优于现有最先进方法,使其适用于交互式高质量动画生成。