多功能运动语言模型：面向多轮交互智能体 (Versatile Motion Langauge Models for Multi-Turn Interactive Agents)

Recent advancements in large language models (LLMs) have greatly enhanced their ability to generate natural and contextually relevant text, making AI interactions more human-like. However, generating and understanding interactive human-like motion, where two individuals engage in coordinated movements, remains a challenge due to the complexity of modeling these coordinated interactions. Furthermore, a versatile model is required to handle diverse interactive scenarios, such as chat systems that follow user instructions or adapt to their assigned role while adjusting interaction dynamics. To tackle this problem, we introduce VIM, short for the Versatile Interactive Motion language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. To address the scarcity of multi-turn interactive motion data, we introduce a synthetic dataset, INERT-MT2, where we utilize pre-trained models to create diverse instructional datasets with interactive motion. Our approach first trains a motion tokenizer that encodes interactive motions into residual discrete tokens. In the pretraining stage, the model learns to align motion and text representations with these discrete tokens. During the instruction fine-tuning stage, VIM adapts to multi-turn conversations using the INTER-MT2 dataset. We evaluate the versatility of our method across motion-related tasks, motion to text, text to motion, reaction generation, motion editing, and reasoning about motion sequences. The results highlight the versatility and effectiveness of proposed method in handling complex interactive motion synthesis.

翻译：近年来，大型语言模型（LLMs）的显著进步极大地提升了其生成自然且上下文相关文本的能力，使得人工智能交互更具类人特性。然而，在交互式类人运动（即两人进行协调动作）的生成与理解方面，由于建模此类协调交互的复杂性，该任务仍面临挑战。此外，需要一种多功能模型来处理多样化的交互场景，例如遵循用户指令或适应其分配角色，同时调整交互动态的聊天系统。为解决这一问题，我们提出了VIM（多功能交互运动语言模型），该模型整合了语言与运动模态，能够有效理解、生成并控制多轮对话语境中的交互运动。针对多轮交互运动数据稀缺的问题，我们引入了合成数据集INERT-MT2，利用预训练模型构建了包含交互动作的多样化指令数据集。我们的方法首先训练一个运动分词器，将交互运动编码为残差离散标记。在预训练阶段，模型学习将运动与文本表征与这些离散标记对齐。在指令微调阶段，VIM使用INTER-MT2数据集适应多轮对话。我们在运动相关任务（包括运动到文本、文本到运动、反应生成、运动编辑及运动序列推理）中评估了方法的通用性。实验结果凸显了所提方法在处理复杂交互运动合成任务中的多功能性与有效性。