This paper presents M$^3$GPT, an advanced $\textbf{M}$ultimodal, $\textbf{M}$ultitask framework for $\textbf{M}$otion comprehension and generation. M$^3$GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal control and generation signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling model generation directly in the raw motion space. This strategy circumvents the information loss associated with discrete tokenizer, resulting in more detailed and comprehensive model generation. Third, M$^3$GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M$^3$GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M$^3$GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.
翻译:本文提出M$^3$GPT,一种用于运动理解与生成的高级**多**模态、**多**任务框架。M$^3$GPT基于三个基本原理运行。第一,专注于为各种与运动相关的模态创建统一的表示空间。我们对多模态控制与生成信号(如文本、音乐和运动/舞蹈)采用离散向量量化,使其能够无缝集成到具有单一词汇表的大语言模型(LLM)中。第二,直接在原始运动空间中对模型生成进行建模。该策略规避了离散分词器带来的信息损失,从而产生更细致、更全面的模型生成结果。第三,M$^3$GPT学习建模各种与运动相关的任务之间的联系与协同作用。文本作为LLM最熟悉且理解最深入的模态,被用作在不同运动任务间建立联系的桥梁,促进相互增强。据我们所知,M$^3$GPT是首个能够基于多种信号理解和生成运动的模型。大量实验突显了M$^3$GPT在各种与运动相关的任务上的卓越性能,及其对极具挑战性任务的强大零样本泛化能力。