Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.
翻译:基于Transformer的文本到运动生成的最新进展,在合成高质量人体运动方面取得了显著进步。然而,同时实现高保真度、流式生成能力、实时响应性和可扩展性仍然是一个根本性挑战。本文提出MOGO(单次前向运动生成),一种专为高效实时三维运动生成设计的新型自回归框架。MOGO包含两个关键组件:(1) MoSA-VQ,一种运动尺度自适应的残差向量量化模块,通过可学习的缩放因子对运动序列进行分层离散化,以产生紧凑而富有表现力的表示;(2) RQHC-Transformer,一种残差量化分层因果Transformer,能够在单次前向传播中生成多层运动令牌,显著降低推理延迟。为增强语义保真度,我们进一步引入了一种文本条件对齐机制,以改善文本控制下的运动解码。在HumanML3D、KIT-ML和CMP等基准数据集上进行的大量实验表明,与当前最先进的基于Transformer的方法相比,MOGO在生成质量上具有竞争力或更优,同时在实时性能、流式生成以及零样本设置下的泛化能力方面均有显著提升。