With the rapid progress of large language models (LLMs), multimodal frameworks that unify understanding and generation have become promising, yet they face increasing complexity as the number of modalities and tasks grows. We observe that motion quantization introduces approximation errors that cap motion quality, and that unifying discrete text and continuous motion within a single-stream backbone amplifies cross-modal interference. Motivated by recent multi-branch Transformer designs that separate signals from different modalities, we propose MotionGPT3, a bimodal motion-language model for both understanding and generation. MotionGPT3 encodes raw motion into a continuous latent space using a variational autoencoder (VAE), thereby avoiding quantization-induced artifacts, while leveraging the semantic prior of pretrained language models. A dual-stream Transformer with shared attention preserves modality-specific routes while enabling controlled, bidirectional information flow, which reduces interference, stabilizing optimization, and empirically accelerates convergence without degrading fidelity. For multimodal joint training, a generate-then-align three-stage schedule further improves stability and limits cross-task interference. Experiments show that MotionGPT3 achieves 2x faster convergence in training loss and up to 4x faster convergence in validation, while maintaining state-of-the-art performance on standard motion understanding and motion generation benchmarks.
翻译:随着大型语言模型(LLM)的快速发展,统一理解与生成的多模态框架展现出广阔前景,但随着模态数量和任务复杂度的增加,其面临日益严峻的挑战。我们观察到,运动量化会引入逼近误差,从而限制运动生成质量;而将离散文本与连续运动统一在单流骨干网络中会加剧跨模态干扰。受近期分离不同模态信号的多分支Transformer设计的启发,我们提出了MotionGPT3——一个面向理解与生成的双模态运动-语言模型。MotionGPT3通过变分自编码器(VAE)将原始运动编码至连续潜空间,从而避免量化引入的伪影,同时利用预训练语言模型的语义先验。采用共享注意力的双流Transformer在保持模态特定路径的同时,实现了可控的双向信息流,这有效减少了干扰、稳定了优化过程,并在经验上加速了收敛而不损失保真度。针对多模态联合训练,我们设计了生成-对齐三阶段训练策略,进一步提升了稳定性并限制了跨任务干扰。实验表明,MotionGPT3在训练损失上实现了2倍的收敛加速,在验证集上达到最高4倍的收敛加速,同时在标准的运动理解与运动生成基准测试中保持了最先进的性能。