We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.
翻译:我们提出MoLingo,一种通过在连续潜在空间中进行去噪来生成逼真人体运动的文本到运动(T2M)模型。近期工作采用潜在空间扩散,既包括一次性对整个潜在空间进行扩散,也包括在多个潜在空间上自回归地执行扩散。本文研究如何使连续运动潜在空间的扩散达到最优效果。我们聚焦两个问题:(1)如何构建语义对齐的潜在空间以实现更高效的扩散;(2)如何最优地注入文本条件使运动紧密遵循文本描述。我们提出一种语义对齐的运动编码器,通过帧级文本标签进行训练,使语义相近的文本对应的潜在表示保持邻近,从而使潜在空间更利于扩散。此外,我们比较了单令牌条件化与多令牌交叉注意力方案,发现交叉注意力能带来更优的运动真实感和文本-运动对齐。结合语义对齐的潜在空间、自回归生成与交叉注意力文本条件化,我们的模型在标准指标和用户研究中均达到了人体运动生成的新最先进水平。我们将开源代码和模型以支持后续研究及下游应用。