PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.

翻译：文本到运动生成技术发展迅速，但两大挑战依然存在。首先，现有运动自编码器将每一帧压缩为单一的整体潜在向量，将轨迹与各关节旋转纠缠在非结构化表示中，导致下游生成器难以精确建模。其次，文本到运动生成、姿态条件生成以及长时序合成通常需要独立模型或任务特定机制，自回归方法在长序列推演中会遭受严重的误差累积。我们提出PRISM，通过两项针对性贡献应对这些挑战。（1）关节因子化运动潜在空间：每个身体关节占据独立令牌，形成结构化二维网格（时间×关节），通过具有前向运动学监督的因果VAE进行压缩。这种对潜在空间的简单调整——无需修改生成器——显著提升了生成质量，表明潜在空间设计一直是被低估的瓶颈。（2）无噪声条件注入：每个潜在令牌携带独立的时序步嵌入，使得条件帧可作为纯净令牌（时序步0）注入，而其余令牌则进行去噪处理。这实现了文本到运动生成与姿态条件生成在单一模型中的统一，并直接支持通过自回归片段链接进行流式合成。自强制训练进一步抑制了长序列推演中的漂移。基于这两个组件，我们训练了单一的运动生成基础模型，可无缝处理文本到运动生成、姿态条件生成、自回归序列生成及叙事运动组合，在HumanML3D、MotionHub、BABEL数据集及包含50个场景的用户研究中达到最先进性能。