Controllable human motion synthesis is essential for applications in AR/VR, gaming, movies, and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a new method for realistic human motion synthesis, incorporating both low-level trajectory and high-level language semantics controls. Specifically, we first train a VQ-VAE to learn a compact latent motion space organized by body parts. We then propose a Masked Trajectories Transformer to make coarse initial predictions of full trajectories of joints based on the learned latent motion space, with user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce an efficient test-time optimization to refine these coarse predictions for accurate trajectory control. Experiments demonstrate that TLControl outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.
翻译:可控人体运动合成对于增强现实/虚拟现实、游戏、电影和具身人工智能应用至关重要。现有方法通常仅聚焦于语言或完整轨迹控制,在合成与用户指定轨迹对齐的运动时缺乏精度,尤其对于多关节控制场景。为解决这些问题,我们提出TLControl——一种融合低级轨迹与高级语言语义控制的新型真实感人体运动合成方法。具体而言,我们首先训练VQ-VAE学习由身体部位组织的紧凑隐式运动空间。而后提出掩码轨迹Transformer,基于所学隐式运动空间,以用户指定的部分轨迹和文本描述为条件,对关节完整轨迹进行粗粒度初始预测。最后引入高效的测试时优化方法,通过细化粗预测实现精准轨迹控制。实验表明,TLControl在轨迹精度和时间效率上均超越当前最优方法,使其适用于交互式高质量动画生成。