Recent advancements in Multimodal Large Language Models (MM-LLMs) have demonstrated promising potential in terms of generalization and robustness when applied to different modalities. While previous works have already achieved 3D human motion generation using various approaches including language modeling, they mostly % are mostly carefully designed use specialized architecture and are restricted to single-human motion generation. Inspired by the success of MM-LLMs, we propose MotionLLM, a simple and general framework that can achieve single-human, multi-human motion generation, and motion captioning by fine-tuning pre-trained LLMs. Specifically, we encode and quantize motions into discrete LLM-understandable tokens, which results in a unified vocabulary consisting of both motion and text tokens. With only 1--3% parameters of the LLMs trained by using adapters, our single-human motion generation achieves comparable results to those diffusion models and other trained-from-scratch transformer-based models. Additionally, we show that our approach is scalable and flexible, allowing easy extension to multi-human motion generation through autoregressive generation of single-human motions. Project page: https://knoxzhao.github.io/MotionLLM
翻译:近年来,多模态大语言模型(MM-LLMs)在不同模态的应用中展现出良好的泛化能力和鲁棒性。尽管先前的研究已通过语言建模等多种方法实现了三维人体运动生成,但这些工作大多采用专门设计的架构,且仅限于单人运动生成。受多模态大语言模型成功的启发,我们提出MotionLLM——一个简单通用的框架,通过对预训练大语言模型进行微调,即可实现单人运动生成、多人运动生成以及运动描述生成。具体而言,我们将运动编码并量化为大语言模型可理解的离散词元,从而构建出包含运动词元与文本词元的统一词汇表。仅通过训练占大语言模型参数量1–3%的适配器,我们的单人运动生成效果即可与扩散模型及其他从头训练的基于Transformer的模型相媲美。此外,我们证明了该方法具备良好的可扩展性与灵活性,能够通过单人运动的自回归生成轻松扩展至多人运动生成。项目页面:https://knoxzhao.github.io/MotionLLM