Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.
翻译:尽管预训练大语言模型的进展不断拓展,但构建语言与其他多模态数据(如运动)的统一模型这一探索仍具挑战性且至今尚未充分涉足。幸运的是,人类运动展现出与人类语言相似的语义耦合性,常被视作一种身体语言。通过将语言数据与大规模运动模型融合,能够提升运动相关任务性能的运动-语言预训练变得可行。基于这一洞察,我们提出MotionGPT——一个统一、多功能且用户友好的运动-语言模型,用于处理多种运动相关任务。具体而言,我们采用离散向量量化技术处理人类运动,将3D运动转化为运动标记(motion tokens),其生成过程类似于词汇标记(word tokens)。在此“运动词汇”基础上,我们将运动与文本视为统一形式的语言建模,把人类运动当作一种特定语言。此外,受提示学习启发,我们使用混合运动-语言数据预训练MotionGPT,并基于提示驱动的问答任务进行微调。大量实验表明,MotionGPT在多项运动任务(包括文本驱动的运动生成、运动描述、运动预测及运动插值)中均达到了最先进的性能。