Controllable human motion synthesis is essential for applications in AR/VR, gaming and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a novel method for realistic human motion synthesis, incorporating both low-level Trajectory and high-level Language semantics controls, through the integration of neural-based and optimization-based techniques. Specifically, we begin with training a VQ-VAE for a compact and well-structured latent motion space organized by body parts. We then propose a Masked Trajectories Transformer (MTT) for predicting a motion distribution conditioned on language and trajectory. Once trained, we use MTT to sample initial motion predictions given user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce a test-time optimization to refine these coarse predictions for precise trajectory control, which offers flexibility by allowing users to specify various optimization goals and ensures high runtime efficiency. Comprehensive experiments show that TLControl significantly outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.
翻译:可控人体运动合成对于增强现实/虚拟现实(AR/VR)、游戏以及具身人工智能等应用至关重要。现有方法通常仅关注语言控制或完整轨迹控制,在合成与用户指定轨迹(尤其是多关节控制轨迹)精确对齐的运动方面存在不足。为解决这些问题,我们提出了TLControl,一种通过融合基于神经网络与基于优化的技术,同时结合低层轨迹与高层语言语义控制的新型逼真人体运动合成方法。具体而言,我们首先训练一个VQ-VAE,以构建按身体部位组织的紧凑且结构良好的潜在运动空间。随后,我们提出一种掩码轨迹变换器(MTT),用于预测以语言和轨迹为条件的运动分布。训练完成后,我们使用MTT根据用户指定的部分轨迹和文本描述作为条件,采样初始运动预测。最后,我们引入一种测试时优化方法,对这些粗略预测进行精细化处理以实现精确的轨迹控制;该方法允许用户指定多种优化目标,提供了灵活性,同时确保了较高的运行时效率。综合实验表明,TLControl在轨迹精度与时间效率方面显著优于现有先进方法,使其能够实际应用于交互式高质量动画生成。